EXAM SOLUTIONS. Computer Vision Course 2D1420 Thursday, 11 th of march 2003,

Numerical Analysis and Computer Science, KTH Danica Kragic EXAM SOLUTIONS Computer Vision Course 2D1420 Thursday, 11 th of march 2003, 8.00 13.00 Exercise 1 (5*2=10 credits) Answer at most 5 of the following questions. If you answer more than five, only first five answers will be considered and the rest ignored. (1) What is fovea and what is its function? Describe what are cones and rods. Fovea: shallow pit close to blind spot with high concentration of photoreceptors, fraction of mm in diameter, supports high-resolution vision in bright light, controls gaze. Cones (tappar) : used for photopic (light) vision, concentrated in the fovea centralis of the macula, three color types - red, green, blue, require bright light, relatively sparse outside fovea, wavelength specific. Rods (stavar): used for scotopic or dark vision, distributed over the entire retina, with a peak at about 20deg from the fovea centralis, similar sensitivity wrt color (not wavelength specific), very sensitive to light (quantum level). Each eye contains about 120 million rods and 5 million cones. (2) Explain terms radiance and irradiance. Radiance: Amount of light radiated from a surface into a given solid angle per unit area. The area is the foreshortened area, as seen from the direction that the light is being emitted. Irradiance: Light power per unit area (watts per square meter) incident on a surface. If surface tilts away from light, same amount of light strikes bigger surface - foreshortening (less irradiance). (3) Explain terms sampling and quantization. Sampling: selection of a discrete grid to represent an image; Spatial deiscretization of an image Quantization: Mapping of the brightness into a numerical value; Assigning a physical measurement to one of a discrete set of points in a range. 1

(4) What are the common image points distance measures? Give at least two examples. Common distance measures: Euclidean distance City block distance Chessboard distance d(p,q) = (x u) 2 + (y v) 2 d(p,q) = x u + y v d(p,q) = max( x u, y v ) (5) What is ment by contrast reversal in terms of grey-level transformations? Draw the corresponding linear transformation. Contrast reversal - basically inverting the grey-level values. f (x) = g(1 x) (6) How is a signal in a Fourier domain affected by a) translation and b) mirroring of the original signal in the spectral domain. Translation affects only phase while power spectrum is translation invariant. Mirroring in spectral and spatial domain are the same. (7) What is ment by ideal low-pass filter? Is this filter suitable to use in terms of image processing? If yes, give an example of its application. If no, explain why. The ideal low-pass filter is a mathematically idealized version of a smoothing filter. In a frequencydomain representation of the image, all frequencies above a threshold F are discarded, i.e. this 2

filter passes low frequencies so image becomes blurred. This method of smoothing tends to create images with ringing at sharp boundaries in the picture which is its drawback. The observation that the application of the low-pass filter is equivalent to the convolution of the image with the sinc function provides an explanation for this phenomenon. This filter cannot be implemented in hardware. Ideal low pass filter in 2D: ĥ(v) = rect( v ) where v = v 2 1 2v + v2 2 c v c cut-off frequency Impulse response h(x) = 2πv 2 J 1 (2πv c x ) c 2πv c x J 1 - first order Bessel function (8) In what cases is spectral filtering more appropriate than spatial one? Give examples. 1) If the noise is periodic, 2) If we want to filter the image using large kernels. (9) What is ment by region growing? When is it commonly used? Region growing - homogeneous regions grows in size by including similar neighboring pixels, the final result does not necessarily need to cover the entire image First, seed regions have to be extracted, and these seed regions are iteratively grown at their borders by accepting new pixels being consistent with the pixels already being contained in the region. After each iteration, the homogeneity value of the region has to be re-calculated using also the new pixels. The results of region growing heavily depend of a proper selection of seed points Usage: divide image pixels into a set of classes based on local properties and features (classification), matching in stereo, split-and-merge transform etc. (10) What is region adjacency graph? What corresponds merging in terms of a region adjacency graph? It is used to store region attributes for every region - related to the level of homogeneity of regions. (Image provided in lectures.) merge = graph contraction 3

Exercise 2 (2+2+2+1=7 credits) (a) Describe basics and draw figures of perspective and orthographic camera models. Given a set of parallel lines in 3D - explain what is the difference in their image projections for both models. Perspective projection Orthographic projection x f = X Z, x = X, y f = Y Z y = Y For the orthographic camera model, parallel lines remain parallel and for a perspective camera model the lines intersect in a vanishing point. Image Image focal point Perspective projection Orthographic projection (b) Under what conditions will a set of parallel lines viewed with a pinhole camera have its vanishing point (in the image) at infinity? In case when lines are in the plane parallel to the image plane. (c) If the area of a planar square in the scene is A, what is it s area in an image under orthographic projection? Give your answer in terms of any parameters neccessary to define this relationsship, specifying what each parameter means. A cos θ where θ is the angle between the surface normal and the optical axis. 4

(d) What is the difference between an affine and perspective camera model? Affine camera can be seen as a linear model, while perspective camera is a non-linear one. Affine preserves parallel lines, and perspective does not. Affine camera is a good model when all the points lie on a relatively planar surface and their relative depth from the camera is large. Afiine camera - 8 prameters, perspective camera 12 parameters. Exercise 3 (2+3+2+2=9 credits) (a) An image has been smoothed using following kernel: k [1 5 10 10 5 1]. Can repeated convolutions of an image with the kernel 1 [1 1] 2 be used to otain the same result as with the first kernel. If yes, how many convolutions are needed? If no, explain the reasons why. What should be the constant k so that the filter gain is equal to 1? Yes, after 4 convolutions we obtain the same filter and k = 1 32 since 1+5+10+10+5+1=32. (b) Show that the kernel k [ 1 2 0 2 1] can be written as multiple convolutions of a low pass filter 1 2 [1 1] and a high pass filter [1 1]. What should k be in order for the kernel to be a good approximation of the first order derivative? What is the frequency response of this kernel? Yes, since: [1 1] [1 1] = [1 2 1] [1 2 1] [1 1] = [1 3 3 1] [1 3 3 1] [1 1] = [1 2 0 2 1] It is obvious that k = 1 8 because 3 convolutions with a low-pass filter have been performed and the maximum value should not change. Also, to obtain the kernel [ 1 2 0 2 1] the sign has to be negative. 5

Frequency response (according to lectures): e i2ω 2e iω + 2e iω + e i2ω = 4 i sinω + 2 i sin2ω = i (4sinω + 2 sin2ω) (c) Give an example of a mean (unweighted averaging) filter. Is a mean filter in general separable? Why do we prefer separable filters? Let us take the simplest example: 1 1 1 1 1 1 1 1 1 Since unweighted averaging is assumed, all elements in the matrix are equal. It is always separable since its rank is equal to one. Separable filters are prefered since they can be implemented as repeated convolutions with one-dimensional kernels. (d) Give an image below before (left) and after (right) a smoothing filter was applied. The size of the filter is shown as a small square in the upper-left corner in the image (as you can see, its size is rather small compared to the image size). In your opinion, which one of the following filter types most likely produced the image on the right: 1) mean (unweighted averaging) filter, 2) median filter, or 3) Gaussian filter. Motivate your answer.. Median filter since the line on the left of the image dissapperas - with any other filter it would remain and be extended. 6

Exercise 4 (2+2+1=5 credits) You are given an image of an object on a background containing a strong illumination gradient. You are supposed to segment the object from the background in order to estimate its moment descriptors. (a) Sketch the histogram of the image and explain what are the problems related to choosing a suitable threshold. The histogram is not clearly bimodal - difficult to obtain local minimum. 7000 6000 5000 4000 3000 2000 1000 0 0 50 100 150 200 250 300 (b) Propose a suitable methodology that could be used to perform successful segmentation. a) Local adaptive thresholding selects an individual threshold for each pixel based on the range of intensity values in its local neighbourhood. This allows for thresholding of an image whose global intensity histogram doesn t contain distinctive peaks. b) estimating the illumination gradient and 7

substracting it from the image c) estimating gradient close to the object boundaries and using this to perform adaptive thresholding (c) You have applied histogram equalization to an image. If you apply histogram equalization to an image second time, will you improve image quality even more? Motivate your aswer. No since the first histogram equalization operation will flatten the image as much as possible. The second application will not change the histogram significantly. Exercise 5 (2+2+2=6 credits) (a) What is the epipolar constraint and how can it be used in stereo matching? Represents geometry of two cameras, reduces a correspondance problem to 1D search along an epipolar line. A point in one view generates an epipolar line in the other view. The corresponding point lies on this line. Epipolar geometry is a result of coplanarity between camera centers and a world point - all of them lie in an epipolar plane. X q q o p p o centers of projection (b) Assume a pair of parallel cameras with their optical axes perpendicular to the baseline. How do the epipolar lines look like? Where are the epipoles for this type of camera system? The epipolar lines are parallel to each other (also, in the direction of th image y axis). If the optical centers of the camera are on the same height, the corresponding epipolar lines will be on the same height - one line for both images. The epipoles are in the infinity. (c) Estimate the essential matrix between two consecutive images for a forward translating camera. What is the equation of the epipolar line for the point p=[x y 1]? In general E = t S R where t S is a skew-symmetric matrix related to translation vector. For a forward translating camera (see figure), we have 8

R = I, and t S = 0 t Z 0 t Z 0 0 0 0 0 therefore E = 0 t Z 0 t Z 0 0 0 0 0 From l = E p the epipolar line for a point p = [x y 1] is l = 0 t Z 0 t Z 0 0 0 0 0 x y 1 = t Z y t Z x 0 9

Exercise 6 (2+2+3=7 credits) (a) What is the difference between motion field and optical flow? Motion field: Projection of 3D motion onto the image plane. Optical flow: Apparent motion of the image brightness pattern (not necessarily related to the motion field). (b) Under what assumptions does the optical flow constraint work? When does it not work? Brightness constancy: does not work if the light source moves, while the object is static. Textured region: does not working without sufficient local image structure. alt. Lambertian assumption: might not work if objects have specularities (mirror-like objects). alt. Spatially local constancy: necessary in practice, since derivatives have to be determined within some region. (c) Assume that an object located at a distance 10 m is moving at a speed of 3 m/s in a direction parallel to the image plane as indicated in the figure. How fast should the camera be rotated around the x and y-axis, such that the object remains fixated in the centre of the image. Assume that the focal length is f = 5. y 30 o x The motion of the object in 3D is { Ẋ = 3cos30 ms 1 = 3 3/2 ms 1 2.6 ms 1 Ẏ = 3sin30 ms 1 = 3/2 ms 1 = 1.5 ms 1 The translational motion field is given by { ut = f Ẋ/Z = f 3 3/20 s 1 v t = f Ẏ /Z = f 3/20 s 1 and the rotational motion field is { ur = f ω y v r = f ω x Since { motion field is supposed { to be zero u = ut + u r = 0 v = v t + v r = 0 ur = u t = f ω y v r = v t = f ω x 10

the rotation of the camera has to be { ωx = v t / f = 3/20 s 1 = 0.15 s 1 8.6 s 1 ω y = u t / f = 3 3/20 s 1 0.26 s 1 14.9 s 1 Exercise 7 (3+3=6 credits) Assume that we want to classify an image into one of two classes: C A and C B. We know that the prior probability for A is two time as large as B: p(a) = 2 3 and p(b) = 1 3. After the preprocessing step we get a feature map z. Calculate an expression for a Bayesian classifier: (a) in the case of a one dimensional feature map, σ A = 4, σ B = 1, and m A = m B = 0, and: p(z C k ) = 1 e (z m k) 2 /(2σ 2 k ). 2πσk Sketch the decision functions and calculate the decision boundaries Using Bayes rule (withouth the normalization), we have: p(c A )p(z C A ) = p(c B )p(z C B ) Using the above equation for the PDFs and priors: 2 3 2π 4 e z2 /32 = 1 3 2π e z2 /2 From which we get: Decision boundaries given by: Decision function given by: z = ±1.216 z > 1.216 z < 1.216 Class A if z > 1.216 Class B if z < 1.216 11

(b) in the( case of a) two dimensional ( ) feature map, 1 0 2 0 Σ A =, Σ 0 2 B =, m 0 1 A = m B = 0, and p(z C k ) = Calculate the two dimensional decision boundary. From the above, we have: 1 2π detσ k 1/2 e (z m k) T Σ 1 k (z m k )/2. detσ A = 2, Σ 1 A = 1 ( 2 0 2 0 1 detσ B = 2, Σ 1 B = 1 ( 1 0 2 0 2 Again, using Bayes rule withouth normalization: ) ) 1 3π 2 e z T 2 0 0 1 z 4 = 1 6π 2 e z T 1 0 0 2 4 z 2 e (2x2 +y 2 )/4 = e (x2 +2y 2 )/4 ln2 (2x 2 + y 2 )/4 + (x 2 + 2y 2 )/4 = 0 Two-dimensional decision boundary reprsented by a hyperbola: x 2 y 2 = 4ln2 12