Fabio Remondino. Tracking of human movements in image space

Size: px

Start display at page:

Download "Fabio Remondino. Tracking of human movements in image space"

Maud Parks
6 years ago
Views:

1 Tracking of human movements in image space 1

2 Table of content 1. Introduction 3 2. Human tracking overview 4 3. Data acquisition 6 4. Algorithms overview The least square matching tracker Object tracking The Shi-Tomasi-Kanade tracker Detection and tracking of moving objects Features selection for tracking human body part Results Least square matching tracking Shi-Tomasi-Kanade tracker Detection and tracking of moving objects Object tracking Conclusions Feature works 35 Bibliography 36 2

3 1. Introduction Human motion analysis is receiving increasing attention from researchers of different fields of study. The interest is motivated by a wide spectrum of applications, such as athletic performance analysis, surveillance, man-machine interface, video-conferencing, human-computer interaction, motion capture (games and animation). A complete model of human consists of both the movements and the shape of the body. Many of the available systems consider the two modeling processes as separate even if they are very close. Depending on the applications (animation, visualization, medical imaging) different methods can be used for the measurement of the body shape: laser scanner, infra-red light scanner, photogrammetry, structured light. The modeling of the movement is often obtained capturing the motion with tracking processes: this can be achieved with photogrammetric methods, electromagnetic or mechanical sensors systems and image-based methods. In general the tracking process can be described as the establishment of correspondences of the image structure between consecutive frames, based on features related to position, velocity, shape, color and texture. The main problem is to establish automatically the corresponding features in different images. The tracking is required for 2D and 3D object localization and it is also used for object detection, classification and identification. The main goals of motion studies are to detect moving regions (points, features, areas), estimate the motion, model articulated objects and interpret the motion. It is a very hard task as: - the appearance of the people can vary dramatically from frame to frame; - people can appear in arbitrary poses; - the human body can deform in complex way; - tracked points can be occluded, resulting in ambiguities and multi interpretations; - tracked points (joints) are often not well observable (clothing hide the underlying structure); - it is geometrically under-constrained problem (images are 2D entities of a 3D world). This work focuses on the tracking of movements of humans in monocular sequences of images. Section 2 deals with a general overview of tracking techniques, including motion capture, human modelling processes and moving objects detection. In Section 3 the used techniques for images acquisition and the contrast enhancement process are presented. In Section 4 an overview of the implemented algorithms is described while a short description of interest features is contained in Section 5. Finally Section 6 shows all the results for the validation of the algorithms. 3

4 2. Human tracking overview The main problem of tracking humans (and in particular humans movements) is how to capture the position and motion in space of articulated parts of human body. Typically the tracking process involves the matching between frames using pixels, points, lines and blobs based on their motion, shape or other visual information. Tracking movements of persons and modeling the different part of the human body are two applications very close to each other. There are two main techniques to capture the human motions [2]: (a) Tracking using body markers These tracking systems can be divided in [13]: 1. Systems which employ sensors on the body that sense artificial external sources (e.g. electromagnetic field), or natural external sources. These systems provide 3D world-based information but their workspace and accuracy is generally limited due to the use of the external sources and their formfactor restricts their use to medium and larger sized body parts. 2. Systems which employ an external sensor that senses artificial sources or markers on the body, e.g. an electro-optical system that tracks reflective markers, or natural sources on the body (e.g. a video-camera based system that tracks the pupil and cornea). These systems generally suffer from occlusion and a limited workspace. 3. Systems which employ sensors and sources that are both on the body (e.g. a glove with piezoresistive flex sensors). The sensors generally have small form-factors and are therefore especially suitable for tracking small body parts. These systems allow for capture of any body movement and allow for an unlimited workspace but generally do not provide 3D world-based information. In figure 2.1 some systems for motion capture are presented. Fig.2.1: Different systems for motion capture. Left and right retro-reflective markers. Middle: electro-mechanical system All these techniques are used especially in motion capture where object s position and orientation in physical space are recorded as information in a suitable form that animators can use to control elements in a computer generated scene. The disadvantages of these technique are: - displacement of the markers during movement brings to uncertainty in the results; - difficulty to place on complex articulation (like shoulders, knees); - rigidity in movement (psychological effects) - difficult calibration of the system. 4

The main advantage is the capability of some systems to process the data and produce 3D results in real time.

These methods are often model-based; the image sequences can be acquired either from one camera (monocular vision), or from multiple cameras (multi-views).

[13], probabilistic models of the joint positions [22], particle filtering [3], active part decomposition.

5 The main advantage is the capability of some systems to process the data and produce 3D results in real time. (b) Tracking without markers (marker free methods) Marker free methods are based on image sequences processing/analysis. These methods are often model-based; the image sequences can be acquired either from one camera (monocular vision), or from multiple cameras (multi-views). In monocular case different approaches can be used to track the human body: matching point features, contour extraction (sensitive to noise), 3-D geometric primitives (projected onto the images) [13], probabilistic models of the joint positions [22], particle filtering [3], active part decomposition. In the multi-views approach, multiple cameras acquire simultaneously different views of the person and the 3-D body poses and motions at each time instant can be recovered from the multiimage sequences [7]. The marker free methods offer the subject complete freedom of movement which is not the case of tracking with markers. Image understanding and extrapolation of the third dimension are the main problems for these methods, especially in monocular vision. In this case the 3D coordinates can be induced from the 2D image coordinates e.g. using a Bayesian approach and a set of training data [11] or fitting the projection of a three-dimensional person model through the sequence [21, 24]. The main problems of these approaches are the models of the different part of the body (using cylinders, cones, elliptical cylinders), the large number of degrees of freedom of the model (body joints, rotations, orientations) and the modeling of the motion (prediction of the next steps). In the multi-image approach, stereo-vision can be used to extract 3D information from the sequence. Fig.2.2: Left: geometric primitives projected onto the image [21]. Right: a volumetric human model [1] The interest in human motions analysis can also be limited to detect moving objects in image sequences. In applications as real-time tracking, monitoring of wide-area sites or surveillance, tracking approaches based on moving objects localization and body shape or body boundaries tracking are used (fig.2.3). The moving objects can be identified in the images using background subtraction or optical flow. If also a Fig.2.3: Moving shapes tracking motion of the camera is present, a rectification of the frames must be performed in order to apply the background knowledges [12]. Moving objects in the scene are often segmented while occlusion problems can be solved using temporal analysis and trajectory prediction (Kalman filter) [17]. 5

3. Data acquisition Four sequences (fig.3.1, 3.2, 3.3, 3.4) have been acquired with a Sony DCR-VX700E, a Sony digital handycam that records images in digital format on a mini DV tape.

interlaced, i.e. a full frame is split into two different fields which are recorded and read-out consecutively, odd and even lines of an image are captured at different time and a saw pattern is

6 3. Data acquisition Four sequences (fig.3.1, 3.2, 3.3, 3.4) have been acquired with a Sony DCR-VX700E, a Sony digital handycam that records images in digital format on a mini DV tape. The images are stored in DV format with a size of 720x576 pixel and 24 bit color resolution. The DV format is a Sony property, compressed digital and video audio recording standard. As CCD cameras are interlaced, i.e. a full frame is split into two different fields which are recorded and read-out consecutively, odd and even lines of an image are captured at different time and a saw pattern is created during the digitizing process. For this reason only the odd (even) lines of an image are used in the algorithm, reducing the resolution in vertical direction by 50 per cent. Other two sequences (fig.3.5, 3.6) have been acquired digitizing an old VHS tape. Also in this case the digitalization process creates a saw pattern in the images; therefore reduced images are used for the validation of the algorithm. Fig.3.1: Sequence of 24 frames of a walking man: the camera is rotating on a tripod Fig.3.2: Sequence of 60 frames: the camera is still and the guy is just raising his arms Two sequences acquired from VHS tape (fig.3.5, 3.6) have very low resolution because of the video-tape and the digitalization process (RAZOR software). No way was successful in the enhancement of the frames: different filters and also motion blur compensation didn t achieve good results. Therefore just a local contrast enhancement has been done. 6

7 Fig.3.5: Sequence of 100 frames: two people are walking one over the other. Their trajectories are perpendicular to the camera which is still far away from them Fig.3.6: Sequence of 50 frames of moving people walking towards the camera Fig.3.3: Sequence of 9 frames acquired from VHS tape Fig.3.4: Sequence of 10 frames from VHS tape 7

8 4. Algorithms overview In this section the implemented algorithms are described: least square matching tracker, object tracking and extraction, Shi-Tomasi-Kanade tracker, detection and tracking of moving objects. 4.1 The least square matching tracker The basic idea of this algorithm is to track a selected point through a sequence of images using least square matching (LSM). The process is based on adaptive least square method technique [9] and is similar to [4]. Assume two image regions are given as discrete two-dimensional functions f(x,y) and g(x,y) and that f(x,y) is the template in one image and g(x,y) the patch in the other image; a correspondence is established if f(x,y) = g(x,y) (4.1) Because of random effects (noise) in both images, the above equation is not consistent. Therefore, a noise vector e(x,y) is added, resulting in f(x,y) - e(x,y) = g(x,y) (4.2) The location of the function values g(x,y) must be determined in order to provide the match point. This is achieved by minimizing a goal function, which measures the distances between the grey levels in a template and in an other patch. The goal function to be minimized in this approach is the L2-norm of the residuals of least squares estimation. Eq.4.2 can be considered as a non linear observation equation which model the vector of observation f(x,y) with a function g(x,y), whose location in the other image must be estimated. The location is usually described by shift parameters which are estimated with respect to an initial position of g(x,y). In order to account for a variety of systematic image deformations and to obtain a better match, image shaping parameters (affine image shaping) and radiometric corrections can be introduced beside the shift parameters [9]. An affine transformation is often used and the pixel coordinates of the matched point are computed as x new =a 0 +a 1 x+a 2 y (4.3.1) y new =b 0 +b 1 x+b 2 y (4.3.2) where the 6 parameters of the affine transformation must be estimated from eq. (4.2) by minimizing the sum of the squares of the differences between the grey values in image patches. The function g(x,y) in eq. (4.2) is linearized with respect to the unknow parameters and the obtained linear system is iterated using a Gauss-Markov method [9]. The implemented algorithm uses two images, one as template and the other as search image. The patches in the search image are modified by the affine transformation (translations, rotation, shearing and scaling) and the corresponding point is found in the search image after some iterations. Fig.4.1 shows the result of the least squares matching: the red box is the selected patch in the template image and the green box represents the affinely transformed patch in the search image (emphasize). Fig.4.1: LSM algorithm: template image (left) and search image (right) 8

9 In [4] three sequences of images of three synchronized cameras are available: spatial correspondences between three images at the same instant t and also temporal correspondences between subsequent frames of each camera are computed and 3D trajectory can be determined. In our case the algorithm works with monocular sequences of images and only temporal correspondences can be found. The fundamental operations of the tracking process are three: 1. predict the position in the next frames; 2. search the position with the highest cross correlation value; 3. establish the point in the next frame using lest square matching. If the images have been taken at near time instants, they are strongly related to each other and the image position of two corresponding features is very similar. Therefore, for the frame at time t+1, the predicted position of a point is the same as time t (fig.4.2). Around this position a search box is defined (blue box) and scanned for searching the position which has the higher cross-correlation. This position is considered an approximation of the exact position of the point to be tracked. The LSM algorithm is then applied at that position (red cross) and the result of the matching is considered the exact position of the tracked point in the next frame. Frame at time t: in red the patch for LSM Frame at time t+1: in blue the search area for cross-correlation Fig.4.2: The cross-correlation process to find the approximation for LSM For the frame at time t+2 a linear prediction of the position of the point from the two previous frames is computed (fig.4.3). Then a search box is defined around this predicted position and the point with bigger cross-correlation is used for the LSM computation. For the next frames a linear prediction (based on the previous positions) is always computed even if a more complicated interpolation could be implemented (splines or kalman filter, especially after occlusions). As the algorithm works with monocular sequences, few automatic controls on the corresponding matched points can be performed. In order to verify the reliability of the tracked points, two post-processing verification have been implemented: 1. cross-correlation computation: it checks if the matched point is reliable between two frames. If the cross-correlation coefficient of a point in two consecutive images is smaller than a predefined threshold value, the points is rejected; 2. distance between two matched joints: this test can be performed if the camera does not zoom and is stationary or if its movements are slower than the moving objects; in these cases a distance t-1 t t+1 Fig.4.3: Linear prediction to find the approximated position of the point 9

10 can be computed, in each frame, between two points on the body that must remain at the same distance (e.g. feet-knee, wrist-shoulder). Then the difference of this distance in two consecutive frame is calculated and if the difference does not belongs to a predefined domain, the tracked point is rejected. A cross-correlation computation has been also implemented to recover lost points after occlusions. Manually the user must select the last image where the point is visible and the image where the point reappears. The process finds the new position after occlusion using a suitable window; these coordinates are considered an approximation of the point and the LSM is applied to compute the correct position. If the tracked points have been selected in correspondence of the human joints, a final animation of the tracked points can be done and the 2D trajectories can be drawn. 4.2 Object tracking A tracking process can also involve the extraction of part of objects using few tracked points. Using an images matching process [4] which establishes many correspondences in three consecutive images it is possible to extract the full body (or part of it) through the sequence. The process is based on the adaptive least square method [9] and automatically determines a dense set of corresponding points between images starting from few points sparse on the surface to extract. The template image is divided into polygonal regions according to which of the seed points is closest (Voronoi tessellation)(fig.4.4). zoom seed points matched points o seed points. matched points Fig.4.4: Search strategy for establishment of correspondences between images Starting from the seed points and using a user-defined border of the interest object, the algorithm tries to match corresponding points in three consecutive images. The central image is used as template and the other two as search images. The matcher searchs the corresponding points in the two images independently. The process starts from a selected point, shift horizontally in the template and in the search images and applies the LSM algorithm in the shifted location. If the quality of the matching is good the matched point is stored and the process continues horizontally until it reaches the region boundaries. The covering of the entire polygonal region of a seed point is achieved by sequential horizontal and vertical shifts. In monocular sequences the reliability of the matched surfaces depends only on the matching parameters; in multi-views sequences a control can be done using the computed 3D coordinates and check the wrong correspondences [5]. 10

11 To evaluate the quality of the matched points the following indicators are used: - a posteriori standard deviation of the least square adjustment; - standard deviation of the shift in x and y directions. If the quality of the matching is not satisfactory, the algorithm computes again the process changing some parameters like smaller shift from the neighbor or bigger patch size. At the end of the process, a cloud of 2D points is obtained (fig second row) even if some holes due to not analyzed area can appear in the results: the algorithm tries to close these gaps by searching from all direction around. If the holes are in areas with low texture, the matching does not find many correspondences; therefore the results can be improved by increasing the number of seed points in these areas or using neighborhood information. Fig.4.5: Triplet of successive frames and found 2D correspondences 4.3 The Shi-Tomasi-Kanade tracker In this section the Shi-Tomasi-Kanade tracker [14, 19, 23] will be briefly described. In general, any function of three variables I(x,y,t), where the space variables x and y as well the time variable t are discrete and suitably bounded, can represent the intensity of an image sequence. If the camera moves, the patterns of image intensities change in a complex way; but images taken at near time instants are usually strongly related to each other, because in general they refer to the same scene taken from only slightly different viewpoints. Consider an image sequence I(x,t), with x=[u,v] T the coordinates of an image point. If the time sampling frequency is sufficiently high, we can assume that small image regions are displaced but their intensities remain unchanged. Therefore I(x,t) is not arbitrary but satisfies: I ( x, t) = I ( δ( x), t + t) (4.4) where δ(x) is the motion field, specifying the warping that is applied to image points between time instant t and t+ t. The fast-sampling hypothesis allow us to approximate the motion with a translation, that is, δ(x)=x ± d, where d is a displacement vector. So, a later image at time t+ t can be obtained by moving every point in the current image, taken at time t, by a suitable amount d. As the image motion model is not perfect and because of image noise, equation (4.4) is not exactly satisfied and can be written as: I ( x, t) = I ( δ( x), t + t) + n( x) (4.5) where n is a noise function. 11

12 The tracker s task is to compute the displacement d for a number of selected points for each pair of successive frames in the sequence. The displacement must be computed minimizing the SSD (Sum of Square Differences) residual: ε = [ I ( x + d, t + t) I ( x, t) ] 2 W (4.6) where W is a small image window centered on the point for which d is computed. By plugging the first-order Taylor expansion of I(x+d,t+ t) into eq. (4.6) and imposing that the derivatives with respect to d are zero, we obtain the linear system Gd=e (4.7) 2 I where: G u Iu I = v (4.8.1) W 2 I u I v I v with I u I v = I = I I (4.8.2) u v and e, the error vector, is: e = T I t I u I v W (4.8.3) I with I t =. t The derivatives of the function I can be computed with finite pixel difference but there are always problems with image noise and local minima. A better solution can be achieved with a convolution of the function with special filter (Gaussian kernel). The tracker is based on eq.(4.7): given a pair of successive frames, d is the solution of (4.7) that is d=g -1 e, and is used to compute the position in the new frame. The procedure is iterated according to Newton-Rapshon scheme, until the convergence of the displacement is estimated. The translation model δ(x)=x ± d, cannot account for certain transformation of the feature window we are tracking, for instance rotation, scaling and shear. An affine motion model is more accurate [19]: x + u y + v = a 1 a 2 a 3 a 4 x y + a 5 a 5, (4.9) because two rotations, two translations, a scale in x/y and a shear are considered. 12

13 It computes δ(x) of eq.(4.4) as δ(x)=ax+d (4.10) where d is a displacement and A is a 2x2 matrix accounting for affine warping and can be written as A=I+D, with D=[d ij ] a deformation matrix and I the identity matrix. As in the translational case, the motion parameters D and d are estimated by minimizing the residuals (SSD): ε = [ I( Ax + d, t + t) I ( x, t) ] 2 (4.11) W The equation (4.11) is differentiated with respect to the unknown entries of the matrix D and the vector d and the results are set to zero. Linearizing the resulting system by Taylor expansion, we obtain the linear system: Tz=a, (4.12) where: z=[d 11 d 12 d 21 d 22 d 1 d 2 ] T (4.13.1) contains the unknown entries of the deformation matrix D and the displacement vector d; T a = I t ui (4.13.2) u ui v vi u vi v I vu I W v is the error vector that depends on the differences between the two images; T = W U V V T G (4.13.3) and U is a 4x4 matrix containing the products of the first 4 element of the vector a for each of these elements; V is a 2x4 matrix containing the product of the elements I u and I v for the first 4 elements of a; G as in equation (4.8.1). Finally equation (4.12) can be solved iteratively for the entry of z. In both cases (translational and affine model) feature selection is very important. In [19] is recommended that T (or G) is well conditioned, i.e. the ratio between the largest and the smallest eigenvalue of T (or G) should not be too big (corner selection). Once the displacement has been found and the new position of the point has been determined, a control on the new position must be done. The control is computed with a cross-correlation process: given a template window around the point in frame n and a slave window around the matched point in frame n+1, a cross-correlation coefficient ρ is computed. The corresponding feature in frame n+1 is accepted if the computed ρ is bigger than a user-defined threshold value ρ 0. 13

Usually the STK tracker is not used for tracking human movements in image sequences; but if the images have been taken at near time instants they are usually strongly related to each other and this

4 Detection and tracking of moving objects In applications like video-surveillance and monitoring of human activities, the main idea is to detect and track moving objects (people, vehicles, etc.

14 Usually the STK tracker is not used for tracking human movements in image sequences; but if the images have been taken at near time instants they are usually strongly related to each other and this (extended) tracker can give quite good results for not very long sequences of high texture images. 4.4 Detection and tracking of moving objects In applications like video-surveillance and monitoring of human activities, the main idea is to detect and track moving objects (people, vehicles, etc.) as they move through the scene. Considering one image, regions of moving objects should be separated from the static environment. To identify and separate the moving object, different approach have been proposed: background subtraction [17], 2D active shape models [18], combination of motion, skin color and face detection [8]. If the camera is stationary or its movement are very small compared to the objects, a simple subtraction of two consecutive frames can be used (fig.4.6-c). The resulting image has much larger values for the moving components of the frame than the stationary components. A moving object produces two regions having large values: 1. a front region of the object caused by covering of the background by the object; 2. a rear region of the object caused by the uncovering of the object from the background. Therefore, using a threshold of the image it is possible to detect the rear region of the moving object. The threshold value is determined by experiments. The binary thresholded image can contain some noise which can be easily removed with an erosion process or with a median filter (Fig.4.6-d). (a) (b) (c) (d) Fig.4.6: Example of image subtraction. Two frames of a sequence (a, b). Binary image after absolute images difference with noise (c): black pixel represent movements. Result after median filter (d) Once the moving objects have been localized, their bounding boxes can be computed. For this purpose vertical projections of the binary image is at first performed (fig. 4.7). The different objects in the image are often already visible from this projection. The position of the objects in the horizontal axes are determined by slicing the vertical projections. If the counted number of pixel in a slice is higher than a threshold, then the slice is identified as an area of moving activities. This is done for all the slices along the horizontal axes and finally the adjacent slices with moving activities are joined together obtaining a set of areas where moving activities have been detected (fig. 4.7). The size of the slices can be adapted to the specific conditions of the acquired images. The smaller the slices are, the better will be the precision of the detected areas, but if the 14

15 slices are too small, then different moving objects could be detected as a single moving object. The threshold for the identification of a slice as a moving area depends on the size of the slices and has to be determined by experiments. Fig.4.7: Vertical projection (left) with 2 picks representing the two men. Vertical lines (right) delimiting the moving objects Then the same process is performed with the horizontal projections of the different determined areas of the horizontal axes. The horizontal projection of a person is sometimes divided in 2 different moving areas: indeed the middle of the body is usually not moving during the walk, therefore it is not detected. Once the moving areas are detected, the square bounding boxes can be obtained. Fig.4.8: Horizontal projections of the x-axis areas (left) and computed bounding boxes (right) In case of occlusions (two people walking one towards the other), it can be difficult to divide the vertical projections into its components. To avoid this problem, the center of gravity is computed and the boxes are calculated with respect to this center. Occlusions can also be predicted, detected and handled properly by estimating the positions and velocities of the object and projecting these estimation to the image plane[14]. Once the boxes have been computed, it is possible to visualize the moving foreground regions using background subtraction. 15

16 5. Features selection for tracking human body part Regardless the methods used for tracking, not all the parts of an image contain motion information. Moreover along an edge we can only determine the motion component orthogonal to the edge and so we must take care in selecting the feature to follow in the sequence. In general, to avoid these difficulties, only regions with enough texture are used. In fact a single pixel cannot be tracked unless it has a very distinctive brightness with respect to all its neighbors. As a consequence, it is often hard or impossible to determine where the single pixel is moved in the subsequence frame, based only on local information. Because of these problems, we do not track single points but windows containing good features and sufficient texture. The point features are usually extracted by local operators, often called interest operators. The attributes are computed within a rectangular or circular window, in selected or in all directions and are usually compared to a threshold to decide whether a feature is good or not. Many feature point extractors have been proposed in the last years [6, 10, 20]. Concerning all these interest operators, some characteristics can be found: 1. they work with a predefined or arbitrary idea of how a good window looks like; 2. they assume that a good feature is independently of the tracking algorithm; 3. they often find features well trackable only in pure translation; 4. they often find features which are good only in the first frames. So the resulting features are not guarantee to be the best for the tracking algorithm all over the sequence. Therefore a feature point must be consistently and should have enough information in its neighborhood over the different frames. Concerning tracking operations, researchers have proposed to track features as corners, windows with high spatial frequency content or region where some mix of second-order derivatives is sufficiently high [19]. But for human body movements tracking, as we want to extract 2D or 3D information from the tracked points, we cannot take the features randomly all over the body as an interest operator would make, or just in correspondence of edges, but we must select precisely points (joints). We are interested in capture the movement of the human body, therefore we should select points which can define the motion. Usually points in correspondence of head, shoulders, elbows, wrists, hips, knees and ankles are selected. Once this set of points has been extracted from the image, a human skeleton can be drawn (fig.5.1). Fig.5.1: Skeleton of human body (EPFL) 16

17 6. Results After selecting some points of interest, we can apply the different algorithms to track the points. The first two part of this chapter present the results obtained with least square matching tracker and Shi-Tomasi-Kanade tracker. The results of the detection of moving object, tracking and computation of bounding boxes are presented in the third part while the tracking of an whole object and its visualization is shown in the last part. All the results are in image space: 3D coordinates will be recovered in future works. 6.1 Least square matching tracking The least square matching tracking process starts from some points selected on the image. These results consider points selected manually and in particular positions (fig.6.1.1) as we want to extract a skeleton of the human body. The algorithm using these set of coordinates, computes the corresponding points in the other frames. The parameter file used in the computation contains: - used/not used parameters for affine transformation; - max sigma 0 of the matching; - max sigma-x and sigma-y in the computation of the affine parameter a 0 and b 0 ; - max value for the affine parameters a 0 and b 0 (translation parameters); - size of the window in the template and search image for LSM; - size of the window in the search image for cross-correlation between first and second frame; - size of the window in the search image for cross-correlation in the next frames; - step for cross-correlation computation in the search image; - size of a bigger window in the search image for cross-correlation when the value of LSM is not satisfactory. A result is stored when the computed values of the three sigma and of the two translation parameters are smaller than the default ones in the parameter file. The default value for sigma 0 is 25.0 and for sigma-x and sigma-y is 0.20; usually all the 6 parameters of the affine transformation are used and the max value for a 0 and b 0 is set to 4.0. A post-processing computation checks the reliability of the matched points computing the crosscorrelation coefficient between consecutive frames. The default value is 0.75 but the threshold can be decreased for low resolution images. In the next pages some results of the LSM tracking process are shown. Fig.6.1.1: Points selected on the image 17

The first sequence has been acquired from a VHS tape and has very low resolution; the camera is

14 points have been selected in the first frame; at the end of the process, 10 points have been

The average cross-correlation coefficient of all the points is 0.62.

1 Frame nr.5 Frame nr.9 Fig.6.1.2: Some frames of the sequence with the tracked points In the next

head, neck, shoulders, elbows, wrists, hips, knees and ankles.

have been tracked over the all sequence (fig.6.1.3). The sigma-y was fixed to 0.

18 The first sequence has been acquired from a VHS tape and has very low resolution; the camera is panning following the walking man and 10 frames were available. 14 points have been selected in the first frame; at the end of the process, 10 points have been tracked all over the sequence (fig.6.1.2). The average cross-correlation coefficient of all the points is The LSM algorithm worked with a sigma 0 of 30 while sigma x and sigma y were fixed to Frame nr.1 Frame nr.5 Frame nr.9 Fig.6.1.2: Some frames of the sequence with the tracked points In the next sequence, consisting of 60 frames, 14 points have been selected in correspondence of body joints: head, neck, shoulders, elbows, wrists, hips, knees and ankles. After 10 frames the points in correspondences of the elbows were lost while all the other joints have been tracked over the all sequence (fig.6.1.3). The sigma-y was fixed to 0.30 because the guy was moving his arms in vertical direction and the images have half resolution in vertical direction as only the odd lines are used. (a) (b) (c) (d) (e) (a)frame nr.1 (b)frame nr.11 (c)frame nr. 30 (d)frame nr. 50 (e)frame nr.60 Fig.6.1.3: Points tracked in a sequence of 60 frames 18

19 Because of the presence of clothes, when the guy was moving his arms, the folds of the sweater changed, so points selected in correspondence of big movements of the folds were not matched (or not well matched). The cross-correlation coefficient between tracked points in two consecutive frames was calculated and the results are summarize in Table 1. All the 12 points tracked over the sequence had a crosscorrelation coefficient bigger than 0.9. If the camera is still and stays approximately at the same distance from the subject, another control on the tracked points can be done, computing the differences of the distances between two points with fixed distance, namely feet-knee or neck-shoulder or neck-head. Figure shows the computed differences of the distances in all the frames. There is just a big outlier (with a difference of 4 pixels) while all the other differences are in the interval [-2.4, +2.2] pixels, that is an average error of one pixel for every matched point. The big outlier can be due to the folds of the sweater on the wrist, as said before. Table 1: Average of cross-correlation coefficient Pt1: wrist left Pt2: shoulder left Pt3: neck Pt4: head Pt5: shoulder right Pt6: wrist left Pt7 hip left Pt8 hip right Pt9 knee lefs Pt10 knee rights Pt11 ankle left Pt12: ankle right Distances Distance head neck Distance feet knee left Distance feet knee right Distance wrist shoulder left Distance wrist shoulder right 3 2 Distances [pixel] Frames Fig.6.1.4: Differences of the distances between some joints all over the sequence of 60 frames Once the 2D coordinates of the joints are computed, it is possible to build (by now only in 2D) a skeleton of the human body and represent the stylized person in all the sequence. An animation has been created and a visualization is shown in fig and with cylindric reconstruction of the human body parts. 19

20 Frames Y Fig.6.1.5: Visualization of the computed 2D skeleton to the human body X Frames Y X Fig.6.1.6: Cylindric reconstruction of the human skeleton from 2D points computed with the LSM tracking 20

Another sequence is presented in fig.6.1.7. (a) (b) (c) (d) (e) (a) frame nr.2 (b) frame nr.6 (c) frame nr.12 (d) frame nr.15 (e) frame nr.22 Fig.6.1.7: Tracked frames with occlusion of some points This sequence is composed of 24 frames.

A point on the left wrist was lost quite immediately because of occlusion after 9 frames; also the points on the left leg were lost due to occlusion.

to track again this point after the occlusion. The points on the leg have been recovered after occlusion using a cross-correlation process(fig.6.1.9).

must select both images). The point is found in correspondence of the center of the window with bigger cross-correlation coefficient.

21 Another sequence is presented in fig (a) (b) (c) (d) (e) (a) frame nr.2 (b) frame nr.6 (c) frame nr.12 (d) frame nr.15 (e) frame nr.22 Fig.6.1.7: Tracked frames with occlusion of some points This sequence is composed of 24 frames. 13 points have been selected in correspondence of joints. A point on the left wrist was lost quite immediately because of occlusion after 9 frames; also the points on the left leg were lost due to occlusion. When occlusions occur, a point can be wrongly matched and from the analysis of the cross-correlation results, is possible to remove the outlier and to track again this point after the occlusion. The points on the leg have been recovered after occlusion using a cross-correlation process(fig.6.1.9). A template around the point in the last image where is visible is used; the search area is acquired in the image where the point reappear (the user must select both images). The point is found in correspondence of the center of the window with bigger cross-correlation coefficient. Then the LSM algorithm can track the recovered points in the other frames (fig.6.1.8). Fig.6.1.8: Some frames of the sequence with recovered points after occlusions 21

Fig.6.1.9: Cross-correlation procedure to recover a point lost because of occlusion The mean cross-correlation coefficient of the points in all the sequence is 0.

22 Fig.6.1.9: Cross-correlation procedure to recover a point lost because of occlusion The mean cross-correlation coefficient of the points in all the sequence is 0.88 and the differences of the distances between joints are in the interval [-2.5,+2.5]. The graph of the differences of the distances is shown in fig Fig : Differences of the distances between some joints 22

A final visualization of the sequence with reconstructed human skeleton is show in fig.6.1.

running man and 9 frames were used. 12 points have been selected in the first frame and 7 points have been tracked in all the sequence.

23 A final visualization of the sequence with reconstructed human skeleton is show in fig Y Frames X Fig : Visualization (every 3 frames) of the skeleton built with the tracked points The last sequence has been acquired from a VHS tape; the camera was moving following the running man and 9 frames were used. 12 points have been selected in the first frame and 7 points have been tracked in all the sequence. The LSM sigma 0 was equal 30 while the cross-correlation coefficient had an average of In fig are presented some frames of the sequence with overlapped the stylized skeleton. Frame nr.1 Frame nr.5 Frame nr.9 Fig : A low resolution sequence of 9 frames: the camera is moving following the running man. 7 points have been tracked in all the frames. 23

6.2 Shi-Tomasi-Kanade tracker The core of the STK algorithm was already available on the web; A GUI to select and visualize the tracked points and a routine to compute the process for a whole

12): the image gradient in both windows (for fast converges) are computed with a gaussian kernel; - compute translation d (in the first few iterations) and affine parameters (in the last iterations)

24 6.2 Shi-Tomasi-Kanade tracker The core of the STK algorithm was already available on the web; A GUI to select and visualize the tracked points and a routine to compute the process for a whole sequence have been added. Given two consecutive frames I(x,t) and J(x,t+1), the principal steps of the program are: - compute matrix T (or G) and a (or e) of eq. (4.12): the image gradient in both windows (for fast converges) are computed with a gaussian kernel; - compute translation d (in the first few iterations) and affine parameters (in the last iterations) such that SSD difference of I(Ax+d) - J(x) is minimized (equation 4.11); - re-warp J with sub-pixel 2D bilinear interpolation using the computed affine motion; - check the SSD error. For every point, the algorithm computes n iterations and selects the affine motion parameters with smaller SSD error. The algorithm is very time consuming. In the first sequence of 24 frames, all the points were tracked (recovering those lost for occlusions with the cross-correlation process previously described). The results are shown in fig Frame nr.1 Frame nr.9 Frame nr.15 Frame nr.21 Fig.6.2.1: Four frames of the sequence. In red the points tracked between consecutive frames, in yellow the reconstructed human skeleton. 24

25 The cross-correlation coefficient ρ between consecutive frames has an average of The mean SSD (Sum of Square Differences) error in all the sequence was equal to while the differences of the distances between selected joints is in the interval [-3,+2.3] pixels (except a big outlier of 4 pixels) (see fig.6.2.2). Fig.6.2.2: Graph with the computed differences of the distances between selected joints. Only one big outlier is present while the other values belong to the interval [-3,+2.3] pixel In the successive sequence, 30 frames have been used to validate the algorithm. In the first frame (fig a) 14 points have been selected in correspondence of human joints; in the last frames (fig d,e) 10 points were still tracked while the others were lost due to small cross-correlation coefficient and big SSD. (a) (b) (c) (d) (e) (a) frame nr.1 (b) frame nr.5 (c) frames nr.9 (d) frame nr.20 (e) frame nr.30 Fig.6.2.3: Some frames of the sequence with the points tracked with STK algorithm. In yellow the reconstructed human skeleton 25

26 From the results shown in fig.6.2.3, we can see that the point on the lower left border of the sweater seems to be not correct in the last frames; but it is well visible a movement of the sweater which follows the lifting of the arms. Nevertheless the cross correlation coefficient of that point through the sequence is With the sequences acquired from VHS tape, the STK algorithm didn t give very good results; the selected points were tracked just for 2-3 frames with reliable precision and then were lost or mismatched. The STK algorithm needs very good features (in particular in case of movements) and very good texture around the point that must be tracked. 26

6.3 Detection and tracking of moving objects The detection and tracking of moving objects has been tested on two sequences where two people were walking.

The first sequence (100 frames) shows motions that are roughly on a linear path.

27 6.3 Detection and tracking of moving objects The detection and tracking of moving objects has been tested on two sequences where two people were walking. The program can work with a sequence of n-frames and gives as output the images with the different moving objects in colored boxes. The first sequence (100 frames) shows motions that are roughly on a linear path. Trajectories are linear and parallel to the camera plane and there are occlusions as the two men are directly one over the other. In the results (fig.6.3.1), there are two color-coded boxes, one for each tracked object. Frame nr.9 Frame nr.45 Frame nr.47 Frame nr.71 Fig.6.3.1: Results of moving people detection: tracking before, during and after occlusions 27

28 In fig.6.3.1, the first column show the projections of the pixels along the vertical and horizontal axes; it is easy to divide the vertical projections into its components when there are no occlusions, but when they occur, can be difficult to distinguish the two parts (picks) of the projections. To avoid this problem, the center of gravity of the projections is computed and is used to assign the bounding to the correct object. In the middle column of fig the computed bounding boxes projected on the image differences are shown while the last column presents the projections of the boxes on the original image. Occlusions are visible in the second and third row: the bounding boxes are not very precise because there is overlap between the vertical projections and the limits of the boxes are based only on these projections. More sophisticated computation as temporal analysis or trajectory prediction can be implemented. The moving foreground regions can be visualized with background subtraction. This part of the process is not automatic but could be so if a model of the empty scene is available [17]: once the bounding boxes have been computed, an image where the area inside the boxes is just background is selected. Then a subtraction between the two windows is performed and the moving foreground can be reconstructed with few processes of erosion and dilation (fig.6.3.2). Frame nr.9 Frame nr.45 Frame nr.71 Fig.6.3.2: Foreground moving regions detected by background subtraction In the second sequence (fig.6.3.3), 50 frames were available; two people were walking towards the stationary camera and their trajectories were not perpendicular to the camera. 28

projections and the size of the slices can be adapted to the specific conditions of the frames.

29 Frame nr.5 Frame nr.25 Frame nr.45 Fig.6.3.3: Bounding boxes of two moving people walking towards the camera The computed bounding boxes depend on the sliced projections and the size of the slices can be adapted to the specific conditions of the frames. The projections depend on the image differences, therefore in some frames, small movements of the humans (i.e. feet) are not included in the boxes. Frame nr.45 Fig.6.3.4: Foreground regions detected by background subtraction 29

6.4. Object tracking To complete the tracking procedure, once few points of an object have been tracked all over a sequence, it is possible to extract and visualize the whole moving object by

30 6.4. Object tracking To complete the tracking procedure, once few points of an object have been tracked all over a sequence, it is possible to extract and visualize the whole moving object by establishing many correspondences in some images starting from few tracked points. A clouds of 2D points is obtained and visualized displaying the matched grey value of the image. Using the sequence of fig.3.1, groups of three frames have been created and the middle frame has been used as template image. The seed points have been tracked with LSM tracker (fig upper row) in the all sequence and then used to establish the correspondences. A cloud of points was computed in every frame and then projected onto the image (fig second row). In fig all the matched grey value of the triplet are displayed. Fig.6.4.1: Computed correspondences in a triplet of images Fig.6.4.2: Object extraction from the computed 2D correspondences in a triplet of images. 30

analyzed areas. The gaps can be due to poor texture, low contrast of the area, or wrong matching. In figure 6.4.

31 The central image of fig is the template image: the number of matched correspondences is bigger than in the search images where much more holes are present due to not analyzed areas. The gaps can be due to poor texture, low contrast of the area, or wrong matching. In figure 6.4.3, some b/w and color results of the sequence are shown. Frame nr.2 Frame nr.14 Frame nr.23 Fig.6.4.3: Central frames of the triplets: visualized matched points representing the computed 2D correspondences extracted from b/w images (upper rows). Tracked object in color images (bottom row). 31

A problem encountered in object tracking is the texture: even if the image has high resolution, the matching process does not work with low texture giving big holes located in those regions where

32 A problem encountered in object tracking is the texture: even if the image has high resolution, the matching process does not work with low texture giving big holes located in those regions where there is uniform texture of the subject (central part of the trousers or of the sweater). Some indicators that evaluate the quality of the results are shown in table 2 as average of all the sequence: Table 2: Some indicators to evaluate the quality of the process mean sigma std. dev mean sigma x std. dev mean sigma y std. dev In the second sequence, consisting of 10 frames, the process worked quite well but many gaps in the results occurred (fig.6.4.4). The seed points used for the measurement have been computed using the LSM tracker; there were 18 points in the first triplet but in the successive frames the number decreased because of not corrected matching or occlusions; therefore in the next triplets some seed points have been added. The holes in the tracked object are bigger than other sequences because of the low resolution of the images (fig.6.4.5). Fig.6.4.4: A triplet of the sequence: in the middle column the template image with the matched grey value, at the border the correspondences found in the search images. In table 3 the indicators of the process are presented. Table 3: Some indicators to evaluate the quality of the matching: an average of all the 9 frames mean sigma std. dev mean sigma x std. dev mean sigma y std. dev

Fig.6.4.5: Central template of the next three triplets of the sequence: big lacks of texture on the tracked object are visible because of few seed points and not found correspondences In fig.6.4.6 and 6.

33 Fig.6.4.5: Central template of the next three triplets of the sequence: big lacks of texture on the tracked object are visible because of few seed points and not found correspondences In fig and other triplets are shown. In this sequence the tracked model was moving only his arms: in the first experiments, only 14 points were selected as seed points. Fig.6.4.6: Triplets of a sequence: template image with 16 seed points (upper row). Central template image and search images at the border with the computed 2D clouds of correspondences (lower row) But big holes occurred because of not matched points (high sigma 0 ) in regions of uniform texture. It was necessary to add other two points on the torso of the man in order to extract the whole body (fig.6.4.6, 6.4-7). Also here the matching algorithm failed in regions with low contrast or homoge- 33

34 neous texture (fig.6.4.7, 6.4.8), as homologous points can not be assigned reliably or corresponding points can not be found at all in the images.. Fig.6.4.7: Object extraction from the triplet of images. In order: first search image (frame t-1 ), template image (frame t ), second search image (frame t+1 ) Fig.6.4.8: Central template image of the next three triplet with found correspondences. 34

35 7. Conclusions An overview of some methods for human movements detection and tracking in image space has been presented. Two algorithms that track points in image sequences have been used; the first is based on classic photogrammetric least square matching, the other one is based on a model of affine image changes proposed by Shi, Tomasi and Kanade and available on the net. Both algorithms have been tested on different sequences and the best results came from the LSM tracking. This algorithm can work with longer sequences, with bigger precision and is more reliable than the other one; moreover the LSM tracker can work also with low texture images and, if no occlusions occurs, no big outliers are present. On the other hand, the STK algorithm needs very good texture around the points to track and an efficient outlier rejection scheme too; indeed this algorithm is a very good tracker for indoor sequences full of features (corners) with high texture, but is very time consuming. The object tracking algorithm produced nice results when the images have good and not uniform texture and the seed points were well spread on the object to measure; in low resolution images many holes occurred in the results. It can be considered as a process for object extraction based on tracked points and image matching. The detection algorithm is an automatic process to determine the bounding boxes of moving people in a sequence of frames; it is a very simple implementation but could work with long sequences and avoid problems of occlusions. The precision of the boxes depended on the projections of the pixels and their slices; therefore was very important the choice of the threshold value to compute the image difference. 8. Future works 1. The LSM tracker must be improved in the outliers rejection. The cross-correlation process should be integrated in the main algorithm to reject the mismatched points in real time and not in post-processing. 2. A more accurate and refined process to detect and track objects in case of occlusions should be added. Occlusion can be predicted and avoided with sophisticated algorithms while foreground extraction can be performed with a better background subtraction technique. 3. A camera model could be defined to reconstruct the 3D world from the image coordinate extracted with the tracking process. 4. The object tracking algorithm can be improved adding neighborhoods information in the matching process to close the gaps occurred in the results. 35

Motion Estimation. There are three main types (or applications) of motion estimation:

Motion Estimation. There are three main types (or applications) of motion estimation: Members: D91922016 朱威達 R93922010 林聖凱 R93922044 謝俊瑋 Motion Estimation There are three main types (or applications) of motion estimation: Parametric motion (image alignment) The main idea of parametric motion