Individual Research Grant. Research Grant Application no. 1492/11. General application information

Size: px

Start display at page:

Download "Individual Research Grant. Research Grant Application no. 1492/11. General application information"

Job Barton
5 years ago
Views:

E Individual Research Grant Research Grant Application no. 1492/11 General application information Role Name Academic Appointment Department PI.

1 E Individual Research Grant Research Grant Application no. 1492/11 General application information Role Name Academic Appointment Department PI.1 Moses Yael Senior lecturer Computer Science Institute Interdisciplinary Center (IDC) Herzliya Research Title Early Integration of Multi-camera Information Keywords Multi-camera, tracking, 3D reconstruction, motion recovery, change detection Requested Budget in US Dollars ($1 = 3.9 NIS) 4 $ 61,628 No. of Years Average Annual Budget Institute Authorization Name & Position Signature & Stamp Date

2 Application No. 1492/11 PI1 Name: Yael Moses Abstract: Early integration of multi-camera information A set of cameras that views the same region from different viewpoints provides more information about the scene than a single video. A multi-camera system should thus, in principle, allow better solutions to challenging computer vision problems. Most past effort in computer vision research has been on analyzing data taken by a single camera (static or a video) or a stereo pair. Exceptions are methods developed for a large set of static images. While the state-of-the-art results for single cameras are quite impressive, they are still limited by the number of views and by the ambiguities that arise from the ill-posedness of many computer vision problems. Hence, new methods should be developed in order to capitalize on the information available from a multi-camera setting with overlapping fields of view. Most existing multi-camera techniques consists of two phases. The first phase is to apply successful techniques for existing image or video methods. The obtained results for different views are then integrated as a second phase of the computation in order to improve robustness and reduce ambiguity. In tracking, for example, objects in a multi-camera setting are tracked first in each camera, and only then are the results integrated to reduce ambiguity. The performance of approaches based on integrating the information as a second phase, as in the tracking example, are often affected by the limitations of the techniques used as a first phase of the computation. Similarly, when sequences are available from a set of cameras, the temporal information (e.g., optical flow) and spatial information (e.g., disparity) are often integrated only as a second step. Again, the limitations are similar to those of the state-of-the-art methods for shape reconstruction from still images or those for computing optical flow. In a sense, such late integration of the data may be regarded as a sum of parts. The motivation for the proposed research is to develop novel methods to better utilize the data from a set of cameras, reducing its inherent ambiguity by integrating it at early stage of the computation. To this end, we will focus on 3 tasks: change detection from a multi-camera setting, recovering scene flow, and tracking people in a dense crowd. We expect that early integration will lead, overall, to improve on existing solutions to these problems, because the whole is greater than the sum of its parts. We hope that assembling of the set of solutions for the three tasks will give better insight into the data integration problem, and assist us in developing general techniques and building blocks for integrating information from multi-camera sequences.

3 Application No. 1492/11 PI Name: Yael Moses 1 Detailed description of the research program Early integration of multi-camera information 1 Scientific Background 1.1 Multi-camera settings The use of multi-camera systems for solving challenging problems in computer vision has been an emerging field of research in the last decade. The evolution of multi-camera methods can be traced back to the early 70s, when additional cameras were used to overcome ill-posed problems as well as to obtain robust results. The classical solution to 3D reconstruction from two cameras (e.g., Marr & Poggio [17]) is the first example. The use of three cameras was proposed for further reducing the remaining ambiguity of the correspondence problem (e.g., [11]). Large numbers of static images were used by structure-for-motion algorithms for sparse 3D scene recovery (the phototourism system [26] is an impressive example). Adding temporal information to multi-camera data by using multi-sequences, clearly provides even more information about the scene than any of the alternatives (e.g., scene flow recovery [29]). Decreasing costs together with increasing computational power and bandwidth are expected to make multi-camera systems feasible as a basic platform for scene interpretation. The widespread use of surveillance cameras in cities and stores for civilian monitoring and security, as well as the integration of cameras in personal gadgets such as cell phones, necessitate extending computer vision research to utilize the information available from multi-camera systems. A central challenge in multi-camera systems is how to efficiently integrate the useful information from the set of sequences. Most existing techniques rely strongly on applying successful existing single camera methods. The obtained results are then integrated into the set of videos in a second phase of the computation. Example of such methods for specific applications are listed below. Our objective is to develop novel methods for solving challenging computer vision tasks through early integration of the data from a set of videos. The tasks we intend to consider include recovery of 3D motion and structure, tracking, and change detection. We next present some brief background on these tasks as well as review the integration level used for these applications. 1.2 Change detection Change detection is a basic and well-studied problem in computer vision. It is often used as a first step in surveillance applications, remote sensing, tracking, and as a focusing mechanism. There are numerous approaches for solving this problem, most based on a single camera (see reviews by [25, 3, 3]). Exceptions are methods proposed by [12, 18], where a multi-camera setting is used. These methods solve the change detection in each of the cameras and integrate the results as a second phase by aligning the floor images, using homography transformation, and measure changes only on the floor plane.

4 Application No. 1492/11 PI Name: Yael Moses 2 Several main challenges are considered by change detection methods. These include handling illumination changes and shadows, low signal-to-noise ratio, and a moving background (e.g., moving leaves). A challenge, outside the scope of existing methods is the presence of reflectors such as windows, mirrors, or puddles in the scene. The background, in this case, may be any reflected image from these surfaces and hence cannot be learned. We propose to develop new methods for computing change detection using a set of cameras with overlapping fields of view. The novel change detection methods are intended to handle reflecting surfaces, and reduce the required signal-to-noise ratio for detection. Our methods will rely on early integration of the data from the set of cameras, and the results will be compared to a naïve integration method. 1.3 Tracking Object tracking is an important component for many applications such as missing persons search (e.g., a lost child), surveillance, and behavior analysis. Tracking people or objects can be a very challenging task due to occlusions, change in illuminations, and ambiguities in the appearance of moving objects. Although the single camera state of the art for tracking people is very impressive (e.g., [14]), it is still limited to a relatively low density of people. Existing single camera tracking in a very dense crowd makes strong assumptions on the expected motion (direction and speed) of the objects (e.g., [1]). To overcome these limitations, multi-camera tracking algorithms were developed. Most perform independent tracking in each of the sequences and then combine the results (e.g., [5, 7, 20, 13]). Methods that try to avoid this approach are based on aligning the floor of the set of images using homography (e.g., [12]). Our recent study on tracking people in a dense crowd [6] demonstrates the benefit of simultaneously integrating the spatial information from the set of sequences, in order to define a set of features (the centers of the head-tops are detected) to be tracked. The early integration of information was performed on static frames from the set of cameras. The temporal information was used as a second phase. This method, can be regarded as an initial framework for solving the multi-camera tracking. We propose to develop a new method that will integrate both the spatial information and the temporal information at an early stage of the computation (see Section 3.3). 1.4 Scene flow and 3D scene reconstruction The raw representation of a general moving scene consists of a dense 3D shape and the scene flow information (dense 3D motion field of a nonrigid 3D scene). Therefore various methods were developed in the last decade to compute this raw data. Due to the inherent ambiguity of the problem, multi-camera sequences were used as input. In this case, the relevant information available from a single camera is the optical flow (the projection of the scene flow). The set of static frames taken from different views provides the structure. In general, given reliable solutions for both stereo and optical flow, the scene flow and the 3D structure are directly solved. There are numerous methods for solving the optical flow problem as well as for solving stereo or multi-view stereo. Indeed most previous approaches for recovering both scene flow and 3D structure decouple the two problems (e.g., [28, 29, 33, 34, 4, 23, 31]). Computing optical flow and stereo independently may give inaccurate results due to errors in each of the algorithms and inconsistency between their solutions. When the scene flow and 3D structure are decoupled, the two problems are solved sequentially. As a result of this second phase integration,

5 Application No. 1492/11 PI Name: Yael Moses 3 the spatio-temporal information is not fully utilized. In Vedula et al. [29], for example, the optical flow field is computed independently for each camera without imposing consistency between the flow fields. Another example of the limitations of decoupling is the study by Wedel et al.[31], where consistency is enforced on the stereo and motion solutions. However, the disparity map of the first time step is computed separately, and the results are still sensitive to its errors. Previous approaches for simultaneous recovery of scene flow and 3D structure help overcome these limitations (e.g., [30, 10, 19, 9, 21]). However, most of these methods, as well as others that decouple the structure and motion recovery [33, 34, 29, 10, 19, 9, 31, 15], rely on 2D parametrization (e.g., disparity and optical flow). Hence, they suffer from the limitations of 2D parametrization. That is, they are either applicable to only two views, or each additional view adds an additional set of unknowns to the problem. In a recent study [2], we proposed a novel method for recovering the 3D structure and scene flow from a pair of sequences taken at two time steps. In an attempt to simultaneously integrate the data from all frames, we proposed 3D point cloud parametrization of the 3D structure and scene flow allowing us to directly estimate the desired unknowns. A unified global energy functional was proposed to incorporate the information from the available sequences and simultaneously recover both depth and scene flow. The results of our method show improvement over existing state-of-the-art results. We extend on this study in Section 3.2. This method is regarded as an initial framework for solving the scene flow problem, which we intend to develop in the proposed research. 2 Research Objectives and Expected Significance The ambiguity inherent in single-camera data for essential computer vision tasks can be significantly reduced by using data from a set of video sequences. The main thesis underlying this proposal is that a genuine advantage can be gained by early integration of data from a multi-camera system with overlapping fields of view. We propose to establish this thesis by developing early integration methods for three basic tasks in computer vision: change detection, tracking, and reconstruction of scene flow and 3D structure. While these tasks are sufficiently distinct to require different techniques, we intend to demonstrate (and our initial work has shown) that in all cases, early integration of multi-camera data is beneficial. This will strongly suggest that early integration is an important component to be considered in multi-camera computer vision. Of independent value are the solutions that we plan to provide to three basic computer vision tasks in the challenging multi-camera setting. Since multi-camera systems are currently available, and the state of the art is still considerably short of providing ample techniques for dealing with them, developing such effective techniques of is imperative. In particular, the proposed method will allow tracking in extremely challenging conditions that include high crowd density and significant variation in illumination. The change detection algorithm we intend to develop will allow changes to be detected under extremely difficult conditions with low signal-to-noise ratio in scenes that contain reflecting elements. Finally, the novel 3D parameterization used for recovering 3D scene and 3D motion will set the foundation for robust and efficient recovery of the raw data of a scene, which can then serve many other higher-level computer vision as well as computer graphics applications.

6 Application No. 1492/11 PI Name: Yael Moses 4 3 Detailed Description of the Proposed Research 3.1 Setting a multi-camera system In order to efficiently integrate the information from the set of cameras, geometric calibration, synchronization, and often also photometric calibration must be considered. The geometric calibration of a set of cameras with overlapping fields of view is a well-studied problem, and solutions are obtained by means of calibration devices or using state-of-the-art automatic calibration methods (e.g., [8]). Synchronization between videos is also necessary for correlating video data from different cameras. A commonly used calibration device (e.g., a chase calibration board) cannot be used when the cameras view a physically large part of the scene. The synchronization of cameras can also be solved using the special visual device. Alternatively, synchronization can be computed and maintained using an online algorithm recently developed in our lab [24]. A calibration and synchronization device develped in our lab consists of a set of five poles, with 3 fixed height blinking leds. For each led, a different blinking frequency was used. The differences in the frequencies allows us to easily compute the correspondence between the leds in the different images, as well as to synchronize the videos D parameterization for structure and motion recovery A first step in integrating the information from a set of cameras to recover the 3D structure and scene flow was proposed in our recent paper [2]. Here we highlight the important components of this method, and then we extend on the proposed research for solving the problem of full recovery of the scene flow and structure from a set of multi-camera sequences. The input to our method was a set of two successive frames from each of n sequences. The information from these frames were integrated by using a 3D point cloud parametrization of the 3D structure and scene flow. This parameterization allows us to directly estimate the desired unknowns, and it naturally integrates the information from all frames. A unified global energy functional incorporates the information from the available sequences and simultaneously recovers both depth and scene flow. This approach is in contrast to most previous methods that decouple the motion (optical flow) and structure (disparity) into two separate problems, and solve the entire shape and motion recovery problem in two phases (see discussion in Section 1.4). The advantage of using 3D rather than 2D parametrization is that it allows primary assumptions to be imposed on the unknowns prior to their projection. For example, a constant 3D motion field of a scene may project to a discontinuous 2D field due to discontinuity in depth. Hence, in this example, smoothness assumptions hold for 3D parametrization but not for 2D. In addition, 3D parametrization allows direct extension to multiple views, without changing the problem s dimension. That is, the number of unknowns remains minimal, regardless of the number of views. This is in contrast to 2D parameterization, where each additional view requires an additional set of parameters (e.g., disparity or optical flow maps). Our proposed parameterization and functional have never been used before despite naturally capturing the problem. This is probably due to the challenges involved in using this approach. For example, combining our 3D representation in a multi-view variational framework results in a challenging nonconvex optimization problem. Moreover, due to our 3D representation, the relation between the image coordinates and the un-

7 Application No. 1492/11 PI Name: Yael Moses 5 knowns is nonlinear (as opposed to optical flow or disparity). Consequently, the derivation of the associated Euler Lagrange equations involves nontrivial computations. In addition, the use of multiple views requires that occlusions be properly handled because each view adds more occluded regions. Obviously, the occlusion between the views becomes more severe when a wide baseline rig is considered. (We extend on this issue below.) Our variational framework, which was used for the first time for multiple views and 3D representation, successfully minimizes the resulting functional despite these difficulties. We believe that we can extend this framework to solve the general problem addressed here. In the proposed research we suggest using the foundation set by [2] to develop a method for full scene flow and 3D reconstruction method from a set of sequences. We next list the four directions that we propose to develop. All require further integration of additional data. Full sequence: Using multi-view sequences for estimating the 3D structure and scene flow provide redundant information that can be utilized to improve the results over time. Currently, our method exploits the mutual spatio-temporal information from only two frames from each sequence. We intend to extend our method to a longer video to achieve a continuous solution of the 3D structure and scene flow over time. The integration of the information from the entire set of sequences is non-trivial since each pair of time steps produces the 3D structure and scene flow. Applying the method to a successive pair may result in a structure inconsistent with the previous computed one. Hence, great care should be taken to extend the method to the entire set of sequences. multiple reference views: In many applications, complete view-independent models of the 3D structure and motion are desirable. In our preliminary study, the proposed 3D parameterization results in a view-centric representation. That is, the structure and its 3D motion are estimated only in regions that are visible by the reference view. We intend to achieve a generalized view-independent model of the structure and scene-flow. To do so, we intend to study a method that registers and integrates the outputs from multiple reference views and multiple time steps. Occlusions: Another essential research direction addresses the problem of occlusions. When considering a multi-view system, determining correctly occluded regions can be crucial for the final results, since each view increases the prevalence of semi-occluded regions (regions that are visible in some but not all views). We plan to address two aspects of the occlusion problem: improving the detection of occluded regions and investigating how to properly handle these regions during the optimization process. Performance: Last but not least, we plan to investigate the performance and scalability of our method toward a real time implementation. Nowadays, parallel computing architectures such as multicore CPU and GPU are increasingly prevalent. In particular, parallel multigrid solvers are being developed, allowing substantial speedup for known iterative solvers such as Gauss Sedial or SOR. Hence, we believe that our method can be significantly accelerated. Note that the number of unknowns in the current phase of our method is set to the minimal required to present the problem, and it is independent of the number of cameras. Moreover, GPU is appropriate to efficiently convolve and warp images, which are heavy repetitive computations in our method.

8 Application No. 1492/11 PI Name: Yael Moses Multi-camera tracking: detailed description Our previous study [6] suggests a method for tracking people in a dense crowd in the presence of severe occlusions and challenging illumination conditions (See Fig. 1. The tracking was based on detecting people by integrating the spatial information from a set of cameras at an early stage of the computation. The performance of the method exceeds that of competing state-of-the-art in a dense crowd tracking. In order to improve the method to handle even higher crowd density, we will also consider early integration of temporal data at the detection stage. Let us first review the relevant aspects of our previous study. A main challenge addressed by our previous method is that most of the body of a person walking in a dense crowd may be occluded. To overcome this challenge, the feature to track was defined to be the top of the head of each person. A set of cameras was placed at a high elevation from which the heads are almost always visible. Even under these conditions, head segmentation using a single image is challenging, since in a dense crowd people are often merged into large foreground blobs, and using body shape is problematic, due to occlusions. To overcome this problem, our previous method combined information from the set of cameras (static, synchronized and partially calibrated). The method relies on the assumption that the head is the highest region of the body. A head-top forms a 2D blob on the plane parallel to the floor at the person s height. The set of frames taken from different views at the same time step is used to detect such blobs. For each height, the foreground images from all views (each may be a blob containing many people) are transformed using a planar homography to align the projection of the plane at that height. In Fig. 2 we demonstrate this process on a scene with a single person. Using intensity correlation, the top of the heads are detected. The integration of different heights as well as the tracking of these features are described in the paper. We propose to combine at an early stage of the computation both the temporal and the spatial information from the set of sequences. In particular we intend to integrate information from two sets of frames, taken at two successive time steps. The general idea is to align frames for each expected human height, as in the previous study, and correlate the optical flow, in addition to the color correlation used in our previous study. The optical flow can be computed using the standard Lucas-Kandade [16] optical flow algorithm for each pair of the transformed frames taken from the same camera. We expect the outcome of the proposed method to improve head detection under challenging conditions, and to provide additional information on each detected head to be used by the tracking phase (the direction of the motion). The algorithm outline: 1. Compute the optical flow maps at each camera using Lucas-Kanade algorithm. 2. Align the optical flow maps computed for each camera with the reference camera, for various expected human heights (as in the previous study). 3. Correlate the optical flow maps to detect head tops. (Below we extend on this stage.) 4. Add correlation as an indication for head-top location to the basic multi-camera head detector. 5. Use the directions of the optical flow as an additional information for the multi-camera tracker. Intuitively speaking, we expect the optical flow vectors at regions that correspond to the correct plane (a head at that height) to be similar across all views. Other regions are expected to have inconsistent optical flow

9 Application No. 1492/11 PI Name: Yael Moses 7 vectors. Homogeneous background regions are also expected to have similar motion (probably small motion) in all frames. Hence, should be eliminated from the computation. The equivalent for the intensity variant of corresponding pixels in our original multi-camera head detection should be defined for measuring the variations of the optical flow from a reference optical flow. That is, for each pixel at each map, corresponding to a given view, we want to measure the variation of the aligned optical flow from a reference optical flow (step 3 of the algorithm described above). The first question is how to compute the reference optical flow vector. One choice is to take the average of the vectors from all views. It is not clear, in this case, how to correctly normalize the directions and the sizes of the optical flow vectors. Hence, we take a different approach for computing the reference vector. We suggest a modification of the Lucas-Kanade optical flow to a set of aligned images, which we name multi-view optical flow computation. The modification is simple. Instead of considering the w w neighborhood of each pixel for computing the structure tensor of the pixel, the computation is performed on the k w w neighborhood. The computed map is then used a reference value. The result of multi-view optical flow for a given height together with the results of the optical flow of each view are presented in Fig The next step is to choose the measure for testing variation from the computed multi-view optical flow. These may include direction, size, or both. Figure 5 shows a color map of the result of using only direction correlation. It can be seen that indeed high correlation values, colored in red, are in agreement with the head location (compare to Figure 4). We intend to consider additional measures for correlation, such as the leastsquare error of the optical-flow and normalized by zero motion error (subtraction of two successive frames). Note that the density in the presented examples is not very high. We believe that combining optical flow with intensity from the previous study will allow us to handle denser crowds than have been considered before, and to reduce false positive detection of heads early at the feature detector level. (In our previous study most of the false positives were removed only in the tracking phase.) We expect this method to result in more robust detection of head-tops, a motion vector associated with each head top location. The motion associated with each head location is expected to improve the tracking phase of the algorithm. Hence, the assembly of these components is expected to be more robust than in our previous method, where early integration was used only in the spatial domain. If successful, it will demonstrate our thesis of the advantages of integrating as much data as possible at an early stage of the computation. 3.4 Multi-camera change detection Our objective is to use the multi-camera setting for overcoming challenging change detection cases. We assume a wide-baseline set of calibrated cameras. One of the main challenges in change detection is that the signal may be at the noise level of the background. Existing single camera methods attempt to overcome this challenge by improving the background models. The approach that we intend to take is to utilize additional information obtained by other cameras that view the same scene. We assume that the image noise taken by different cameras is not correlated. Specifically, we are interested in discriminating between signals and reflections. Like many existing approaches, ours is pixel-based, where the background value of the pixels can be described by a density probability function, p(x B), where x is the grey-level or a color. For example, a

10 Application No. 1492/11 PI Name: Yael Moses 8 Gaussian distribution is used by [32] or a Mixture of Gaussian which is a more powerful representation of the background [27]. Naively speaking, the magnitude of a change between a measurement and the background distribution, namely a threshold, is used for determine whether the measurement belongs to the background or the foreground. The decision can be more sophisticated when the neighborhood of a pixel is used as well as additional knowledge about the scene. When two independent measures of the same pixel are available, the threshold can be set simultaneously. In particular, a naïve integration of the data available from two or more cameras will result in setting a single new threshold for each of the cameras. We intend to develop a formulation for threshold setting in the spirit described below (3.4.1) for a set of cameras. We expect the thresholds of a pair of cameras to be given as a joint-function rather than a single value. Intuitively speaking, the amount of deviation in a measurement from the background in one image affects the amount of deviation required in the second image for determining a foreground pixel. A larger set of cameras, will require higher order function. The underlying assumption made above is that correspondence between pixels is available. Since we assume a wide baseline with strong reflections, we cannot assume that correspondence can be easily computed when a change occurs. Hence this method is applicable when the structure of the background is known and the foreground object is close to the ground. In this case correspondence can be easily obtained. For example, for a planar background, a homography transformation can be applied in order to align the planes in a set of images. A threshold can then be tested simultaneously on corresponding pixels using the joint-threshold function. We next suggest a modification of the above method, for cases where correspondence is not available. We propose, in this case, a novel approach that integrates the measured differences between an interval and its background along an epipolar line. The basic idea is that even if the exact correspondence between pixels is unknown, the corresponding pixel is known to lie on a known epipolar line. Hence, the change in the corresponding pixel must occur on its corresponding epipolar line. Partial knowledge about the expected location of the 3D point can be used to set the relevant interval of the epipolar line. (A similar idea of using integration along epipolar lines was used in our previous study of synchronizing a pair of cameras [24]). We intend to analyze this case in order to set the appropriate so-called thresholds. In particular, one must ensure that noise integrated along an epipolar line will not reduce the performance rather than improve it with respect to a single camera. The above approaches can be applied also to background that contains reflectors, as long as the reflected images satisfy the assumptions made on a background. We next consider the case where the images contains background reflections from moving objects that are out of the region of interest. For example, a change in background may be due to reflection from a window of a moving object behind the camera. In the next section we present a formal analysis for the case of surfaces that contains reflectors with possibly unrelated objects reflected by them. The analysis is performed for a single camera. We intend to develop similar formulation to set thresholds for the proposed multi-camera change detection method.

11 Application No. 1492/11 PI Name: Yael Moses Modeling background that contains reflectors A background model in the presence of reflectors is assumed to be a mixture of static and reflection distributions. The intensity of a pixel might reflect a static background, S, or a reflection, R. (Note that the static background can also contain reflections of a static scene, but it is treated as static as long as it can be learned.) For simplicity of presentation, we model the static background using a single Gaussian distribution with the parameters µ and σ: P (x S; µ, σ ) = 1 2πσ e 1 2( x µ σ ) 2. (1) Extension of the analysis to a Mixture of Gaussian is part of the proposed research. Without any additional information about the possible reflected scene or the foreground object, we model R as well as a foreground object, O, by a uniform distribution. Thus we have: P (x R) = 1 1, and P (x O) = (2) Let P (S) be the probability of a static background and let 1 P (S) be the probability of reflection. The distribution of the intensity of a background pixel, assuming the intensity value range is 0 x 255, is given by: P (x B; µ, σ ) = P (x S; µ, σ ) P (S) + P (x R) (1 P (S)). (3) Given the distributions for object and background, we use the likelihood ratio test to decide if a given pixel intensity x is foreground or background: Λ (x) = P (x B) c. (4) P (x O) According to the Neyman-Pearson theorem [22], we may choose the threshold c such as to maximize the probability of detection of an object, P d, subject to a predefined probability of a false alarm, P F A : We may obtain the threshold c by substituting equations 3 and 2 into 5: P (Λ (x) < c B ) = P F A, (5) P (x S; µ, σ ) P (S) + P (x R) (1 P (S)) P (x O) c. (6) By simple manipulation we get P (x S; µ, σ ) c + P (S) 1 255P (S) = c 1. (7) If we perform a log on both sides of the equation and perform some additional manipulation, we get that Thus, using Eq. 5 and Eq. 8 we get the following constraint: P F A = µ+c 2 µ c 2 x µ + c 2. (8) µ c 2 P (x S; µ, σ ) P (S) + 1 P (S) dx (9) 255

12 Application No. 1492/11 PI Name: Yael Moses 10 and then finally P F A = P (S) (F N (µ + c 2 ; µ, σ) F N (µ c 2 ; µ, σ)) + 1 P (S) 2c 2, (10) 255 where F N is the cumulative distribution function of a normal distribution parameterized by µ and σ. Given P F A, µ and σ for a given pixel, c 2 can be obtained based on 10 and then c can be obtained too. Figure 6 depicts the relationship between the P F A and c 2 for a given set of µ and σ. As can be seen, for a P F A of 0.01 we need a threshold of c 2 = 115. Assuming that P (s) = 0.9, then the gray-level range of 13 x 243 will be classified as background pixels and only the complementary 0lex < 13, 243 < x 255 will be classified as object. This range is considerably small and results in very poor performance, where the probability of detection is only 0.1. It is important to note that if we take P (S) = 1, that is, no reflectors in the scene, the same derivation will result in the following classification rule: Pixels in the range of 115 x 141 will be classified as background pixels. Pixels in the range of 0 x < 115, 141 < x 255 will be classified as object. Thus for this case, the probability of detection would be 0.9. To conclude, when applying the simple mechanism of background subtraction to cases where reflective surfaces occur in the background, detection performance decreases drastically. This is quite obvious in that the very essence of background subtraction is to identify regions that differ from the normal background, and reflections of new objects meet the same criteria as true objects. Thus, the only method (using a single camera) capable of decreasing the number of false alarms induced by these reflections is post-processing with additional prior knowledge. In contrast, we believe that our multi-camera approach will result in robust change detection despite the presence of reflective objects in the scene. To realize our approach, we intend to develop a framework that uses a wide baseline stereo configuration to create dual-view change detection. We expect this method to handle reflective surfaces very well, without the need for complex post-processing and additional prior knowledge. Because, we intend to consider small reflectors surfaces, P (S) is expected to be independent in the two cameras. 3.5 Preliminary Results This proposal partially overlaps with a proposal that was submitted last year. The main area of overlap is the novel method for 3D recovery of scene flow and structure. In the the mean, the preliminary results presented in in last year s proposal was accepted and presented in CVPR 2010, [2]. We also regard our paper on multicamera tracking [6] as a starting point for the proposed tracking method. Other initial results were described above for each of the three tasks. 3.6 Working Environment The work at IDC will be carried out at the computer science Lab in the Arazi-Ofer building. The lab has 9 USB cameras (IDS ueye UI-1545LE-C). The cameras are connected to old 3 Intel Core Duo 1.7 Mhz laptops, that we would like to replace. These cameras can be used for running the experiments. In addition, we have in our lab a prototype of calibration and synchronization poles.

Tracking results using combined data from all views, overlaid on the

(a) (b) (c) (d) (e) (f) Figure 2: This figure is taken from [6].

the top of his head does not lie on the plain, and is thus not detected at

(c) Applying homography transformation to image b to align points on the

13 Application No. 1492/11 PI Name: Yael Moses 11 Figure 1: This figure is taken from [6]. Tracking results using combined data from all views, overlaid on the reference frame. This figure is taken from [6]. (a) (b) (c) (d) (e) (f) Figure 2: This figure is taken from [6]. 2D patch detection demonstrated, for clarity, on a single, unoccluded person. (The second person, at the back of the image, is 8cm shorter, therefore the top of his head does not lie on the plain, and is thus not detected at this height.) (a,b) Two views of a single scene taken at the same time. (c) Applying homography transformation to image b to align points on the 3D plane at the head-top height with their counterparts in image a. (d) Image c overlaid on image a. (e) Overlay of additional transformed images. (f) Variance map of the hyper-pixels of image e, color coded such that red corresponds to a low variance.

14 Application No. 1492/11 PI Name: Yael Moses 12 Figure 3: Optical flow vectors from each view (red), and combined from all views (yellow). Zoom-in.

15 Application No. 1492/11 PI Name: Yael Moses 13 Figure 4: Optical flow vectors from each view (red), and combined from all views (yellow). Figure 5: Average cosine of angle between multi-view optical ow and single view optical ow. Red - high corelation.

16 Application No. 1492/11 PI Name: Yael Moses 14 Figure 6: The P F A given the threshold c 2 assuming µ = 128 and σ = 5.

17 Application No. 1492/11 PI Name: Yael Moses 15 References [1] S. Ali and M. Shah. Floor fields for tracking in high density crowd scenes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1 14, [2] T. Basha, Y. Moses, and N. Kiryati. Multi-view scene flow estimation: A view centered variational approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [3] T. Bouwmans, F. El Baf, and B. Vachon. Background modeling using mixture of gaussians for foreground detection-a survey. Recent Patents on Computer Science, 1(3): , [4] R.L. Carceroni and K.N. Kutulakos. Multi-view scene capture by surfel sampling: From video streams to non-rigid 3D motion, shape and reflectance. International Journal of Computer Vision (IJCV), 49(2): , [5] W. Du and J. H. Piater. Multi-camera people tracking by collaborative particle filters and principal axisbased integration. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages , [6] R. Eshel and Y. Moses. Tracking in a dense crowd using multiple cameras. IJCV, 88(1):1 15, [7] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), [8] R. Hartley and A. Zisserman. Multiple View Geoemtry in Computer Vision. Cambridge University Press, [9] F. Huguet and F. Devernay. A variational method for scene flow estimation from stereo sequences. In Proceedings of the International Conference on Computer Vision (ICCV), [10] M. Isard and J. MacCormick. Dense motion and disparity estimation via loopy belief propagation. Proceedings of the Asian Conference on Computer Vision (ACCV), 3852:32, [11] A. Ishii and M. Ito. Range and shape measurement using three-view stereo analysis. In CVPR, pages 9 14, [12] S.M. Khan and M. Shah. A multiview approach to tracking people in crowded scenes using a planar homography constraint. In Proceedings of the European Conference on Computer Vision (ECCV), pages IV: , [13] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer. Multi-camera multi-person tracking for easyliving. In International Workshop on Visual Surveillance, [14] C.H. Kuo, C. Huang, and R. Nevatia. Inter-camera Association of Multi-target Tracks by On-Line Learned Appearance Affinity Models. Computer Vision ECCV 2010, pages , 2010.

18 Application No. 1492/11 PI Name: Yael Moses 16 [15] R. Li and S. Sclaroff. Multi-scale 3D scene flow from binocular stereo sequences. CVIU, 110(1):75 90, [16] B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence, volume 3, pages Citeseer, [17] D. Marr and T. Poggio. A cooperative computation of stereo disparity. Science, 194: , [18] R. Miezianko and D. Pokrajac. Multi-layer Background Change Detection Based on Spatiotemporal Texture Projections. Computer Vision and Graphics, pages , [19] D. Min and K. Sohn. Edge-preserving simultaneous joint motion-disparity estimation. In Proceedings of the International Conference on Pattern Recognition (ICPR), volume 2, [20] A. Mittal and L. Davis. Unified multi-camera detection and tracking using region matching. In Proceedings of the IEEE Workshop on Multi-Object Tracking, [21] J. Neumann and Y. Aloimonos. Spatio-temporal stereo using multi-resolution subdivision surfaces. IJCV, 47(1): , [22] J. Neyman and E.S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231( ):289, [23] J.P. Pons, R. Keriven, and O. Faugeras. Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision (IJCV), 72(2): , [24] D. Pundik and Y. Moses. Video synchronization using temporal signals from epipolar lines. In ECCV, [25] R.J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Image change detection algorithms: a systematic survey. Image Processing, IEEE Transactions on, 14(3): , [26] N. Snavely, S.M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In SIGGRAPH, page 846, [27] C. Stauffer and WEL Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on., volume 2. IEEE, [28] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Three-dimensional scene flow. In ICCV, pages , [29] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Three-dimensional scene flow. PAMI, pages , 2005.

19 Application No. 1492/11 PI Name: Yael Moses 17 [30] S. Vedula, S. Baker, S. Seitz, and T. Kanade. Shape and motion carving in 6D. In CVPR, volume 2, [31] A. Wedel, C. Rabe, T. Vaudrey, T. Brox, U. Franke, and D. Cremers. Efficient dense scene flow from sparse or dense stereo data. In Proceedings of the European Conference on Computer Vision (ECCV), [32] C.R. Wren, A. Azarbayejani, T. Darrell, and A.P. Pentland. Pfinder: Real-time tracking of the human body. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7): , [33] Y. Zhang and C. Kambhamettu. Integrated 3D scene flow and structure recovery from multiviewimage sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, [34] Y. Zhang and R. Kambhamettu. On 3D scene flow and structure estimation. In CVPR, pages , 2001.

20 Time schedule and work-plan Objective Beginning End A. Scene flow: long videos + parallel solution Oct 2011 May 2013 B. Scene flow: view-independent model + occlusions May 2013 Sept 2015 C. Change detection: 2 static cameras Oct 2011 May 2013 D. Change detection: multi moving cameras Oct 2013 Sept 2015 E.Tracking Oct 2011 May 2015 Explanatory Notes: Explanatory Notes: The time schedule detailed on this page is approximate, since progress depends on scientific details that have not yet been fully explored. A. Scene flow: the extension to long videos. In this stage we will also need to improve the performance of our method for scalability. See Section 3.2. B. Scene flow: view-independent model + occlusions. In this part of the research we intend to consider additional reference views and to study how to integrate them into a view-independent model. This will allow improving the occlusion treatment. See Section 3.2. C. Change detection: 2 static cameras. We intend to begin with formal analysis and then test the results on real data. See Section 3.4. D. Change detection: multi moving cameras. We intend to extend the results obtained in C to more than two views and pan/tilt cameras (calibration is online). See Section 3.4. E. Tracking: the integration of optical flow into the head detection model and the motion vector into the tracking phase. See Section 3.3.

21 Budget details A. Personnel Name (last, first) Role in project % time Salaries (in $) devoted 1 st year 2 nd year 3 rd year 4 th year Yael Moses PI TBA Ph.D ,000 28,000 28,000 28,000 TBA M.Sc ,000 20,000 20,000 20,000 Total Personnel 48,000 48,000 48,000 48,000 B. Supplies, Materials & Services Item Requested sums (in $) 1 st year 2 nd year 3 rd year 4 th year Strong computer have 3, Disk, memory, etc 1,000 1,000 1,000 1,000 Students travel expenses 1,000 1,000 1,000 1,000 Total supplies, materials & services 5,000 2,000 2,000 2,000 C. Miscellaneous Requested sums (in $) 1 st year 2 nd year 3 rd year 4 th year Photocopies and office supplies Publication charges in scientific journals Professional literature Internet Connection Memberships in scientific associations Total miscellaneous 1,100 1,100 1,100 1,100

22 D. Equipment Item: Price (in $): 3 strong laptops 8,000 Total Price ($): 8,000 Other expenses (including shipping, installation, 0 customs and taxes): Total: 8,000 Funds requested from ISF: 8,000 Justification for requested equipment: We have 9 USC cameras in our lab that will be used in the proposed research. The cameras should be connected to laptops in order to shoot the videos. Each laptop can record the data form three cameras, hence we need 3 laptops.

23 Budget Summary Requested sums (in $) 1 st year 2 nd year 3 rd year 4 th year Personnel, materials, supplies, services & miscellaneous 54,100 51,100 51,100 51,100 Overhead 8,115 7,665 7,665 7,665 Total budget 62,215 58,765 58,765 58,765 Equipment (no overhead on this item) 8,000 Annual Average (including equipment) 61,628 Budget Justification: The research will be performed by one Ph.D student and one M.Sc student in each year. In addition, the 3 laptops are required for shooting the videos with our USB cameras (the laptops we have are about 4 years old). A strong computer is required for running the experiments. Finally, we intend to publish our results in international conferences, we therefore ask a funding for a student travel.

24 Curriculum Vitae Name: Moses Yael A. Academic Background Date (from-to) Institute Degree Area of specialization Weizmann Institute Ph.D. Computer Science Weizmann Institute M.Sc. Computer Science Hebrew U. B.Sc. Mathematics and Computer Science B. Previous Employment Date (from-to) Institute Title Research area 1999-current IDC Senior Lecturer Computer Science Weizmann Institute Post-dctoral Fellow Computer Science and Researcher Oxford University Post-dctoral Fellow Computer Science C. Grants and Awards Received Within The Past Five Years Research Topics Funding Organization Total (in $) Shape reconstruction from a ISF combination of geometr Comments Robust Distributed Vision ISF Comments with Yoram Moses, Technion, Israel. Video Understanding, Learning Content and Noticati Magnet Comments VULCAN Magnet, Israel Ministry of Industry, Trade,

25 List of Publications Yael Moses Refereed Journal Papers: * A1 Eshel R. and Moses Y. (2010), Tracking in a Dense Crowd using Multiple Cameras, International Journal of Computer Vision, 88(1),1 15. * A2 Moses Y. and Shimshoni I. (2009), 3D Shape Recovery of Smooth Surfaces: Dropping the Fixed Viewpoint Assumption, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 71(7), pp * A3 Avidan S., Moses Y. and Moses Y. (2007), Centralized and Distributed Multi-view Correspondence, International Journal of Computer Vision (IJCV), 71(1), pp A4 Moiza G., Tal A., Shimshoni I., Barnett D. and Moses Y. (2003), Image-Based Animation of Facial Expressions, The Visual Computer, 33(7): pp (Also on line publication 2002). A5 Shimshoni I., Moses Y., and Lindenbaum M. (2000), Shape Reconstruction of 3D Bilaterally Symmetric Surfaces, International Journal of Computer Vision, 39(2), * A6 Basri R., and Moses Y. (1999), When is it Possible to Identify 3D Objects from Single Images Using Class Constraints?, International Journal of Computer Vision, 33, pp A7 Moses Y., and Ullman S. (1998), Generalization to Novel Views: Universal, Class-based, and Model-based, Processing. International Journal on Computer Vision, 29, pp A8 Adini Y., Moses Y., and Ullman S. (1997), Face Recognition: the Problem of Compensating for Illumination Changes, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 19(7), pp

A Novel Multi-Planar Homography Constraint Algorithm for Robust Multi-People Location with Severe Occlusion

A Novel Multi-Planar Homography Constraint Algorithm for Robust Multi-People Location with Severe Occlusion Paper ID:086 Abstract Multi-view approach has been proposed to solve occlusion and lack of visibility