A multi-camera positioning system for steering of a THz stand-off scanner

A multi-camera positioning system for steering of a THz stand-off scanner Maria Axelsson, Mikael Karlsson and Staffan Rudner Swedish Defence Research Agency, Box 1165, SE-581 11 Linköping, SWEDEN ABSTRACT Stand-off THz imaging to detect concealed treats is a coming technique for security applications. A THz sensor can provide high resolution 3D imagery of a scene. However, efficient scene scanning and management of the THz sensor is a challenging task due to the limited field of view of the sensor and physical scanning limitations. In this paper we discuss the requirements on a scene scanning solution and present a scene scanning technique using a multi-camera system with 3D positioning capabilities. A visual hull method is used to position subjects in the scene. The presented technique relaxes the requirements on the scanning speed of the THz sensor and facilitates an efficient scene scanning solution. Keywords: Imaging, screening, stand-off detection, teraherz, visual hull 1. INTRODUCTION Detection of concealed threats at stand-off distance is desired in many security applications. Upcoming techniques for stand-off detection use sub-millimeter-wave imaging systems which can provide high resolution 3D imagery of a scene. In such 3D imagery hidden threats can be detected using manual or automated methods and security can be alerted before the threat is close. The THz sensor systems developed for stand-off detection usually has a limited spatial coverage due to the instantaneous field of view, the scanning rate, and the manageable data rate, which puts a demand on methods for efficient scene scanning. A scene scanning system is needed to control the THz sensor and steer it to the 3D position where a subject or specific part of a subject is located. Valuable imaging time can be saved using accurate 3D positioning of each subject to avoid scanning empty areas. Subjects can be tracked and positioned in the scene and at each time the sensor can be steered to the next point of interest. In complex scenarios where subjects are allowed to walk or move freely, the system can also be used to track individual body parts and obtain full coverage of the body incrementally, as different body parts of a person become visible to the THz sensor system. In addition to positioning, a scene scanning system can also provide an estimate of scan completeness or information from the scanning system can be used to merge the high resolution volume data from different scans, e.g., frontal and rear scans of a person, to verify that full coverage of the body is obtained. In this paper we discuss scene scanning methods using multi-sensor approaches. We also present our experimental results from an investigation of a scene scanning technique using a multi-camera system with 3D positioning capabilities. Multi-sensor positioning systems are investigated since their ability to detect and position subjects accurately in 3D is greatly improved compared to using only a single sensor. The 3D positioning is demonstrated on real data acquired from seven HD video cameras. We use a visual hull method where each camera view provide support for the presence of interesting foreground objects in each part of the scene based on an adaptive Gaussian background model. The presented technique relaxes the requirements on the scanning speed of the THz sensor and facilitates an efficient scene scanning solution. Correspondence: maria.axelsson@foi.se Passive Millimeter-Wave Imaging Technology XIV, edited by David A. Wikner, Arttu R. Luukanen, Proc. of SPIE Vol. 8022, 80220L 2011 SPIE CCC code: 0277-786X/11/$18 doi: 10.1117/12.883394 Proc. of SPIE Vol. 8022 80220L-1

Figure 1. Example scene where a single subject is scanned from three sides while walking through an appointed path. 2. SCANNING SCENARIOS Several scenarios can be imagined at a security checkpoint. The problem complexity increases with the number of subjects in the scene and with their allowed variation in pose and position. The simplest case is where a single subject stands in an appointed pose while the THz system scans one side. The subject then turns and the other side is scanned. As the pose and position is roughly known the scene scanning system has much a-priori information. If the high resolution sensor is slow in image acquisition compared to the time it is possible to stand still, it becomes necessary to track body parts and keep track of the scanned volumes. The problem complexity increases when the subject is allowed to move more freely. Figure 1 shows an example scene setup for a case where the subject is moving along a path and scans can be obtained from three different views during motion. The position and pose of a subject walking through the security checkpoint need to be tracked to ensure that all parts of the body are scanned. Depending on the time constraints set by the speed of the THz scanning system, both the position and orientation of the person and the positions and orientations of the arms and legs might be needed. In an environment where one person at the time passes through the scanning area, there is no risk of interperson occlusions. Additionally, the scanning system does not need to shift its focus between multiple persons. This relaxes the requirement on fast sensor movements. It is desirable maximize the flow through a security checkpoint, and hence unnecessary restrictions should be avoided. If multiple subjects are allowed to walk along the designated path simultaneously, there is of course an increase in complexity and such scenarios are further restricted by the image acquisition time. In an environment where subjects can move entirely freely, occlusions are likely to preclude complete scanning of all persons. The highest complexity level is scanning a scene with a free flowing crowd. This is far beyond the achievable horizon at this time. Proc. of SPIE Vol. 8022 80220L-2

3. CAMERA CONFIGURATION FOR A SCENE SCANNING SYSTEM Information about the position and pose of all subjects in a scene can be obtained using many different sensor configurations. We have briefly looked into different sensors, e.g., visual cameras, infrared cameras, and range sensors. However, as many image processing methods are already available for visual cameras we decided to start our experiments with a multi-camera setup using visual cameras. Infrared cameras can also be added to aid detection of subjects. A camera configuration should contain multiple cameras to facilitate positioning of individuals that needs to be screened. When using multiple single cameras, positioning and tracking performance is improved compared to using a single sensor. Both occlusions between subjects and subject self-occlusion can be handled better using a multi-camera setup. If multiple cameras are used, distance measurements can be made using triangulation, stereo images (dense stereo maps) can be calculated using pairs of cameras to identify several individuals which are partly occluded easier, and 3D positioning methods like the visual hull can be used. The method using visual hull is further described in Section 4. The camera positions must be known to be able to calculate stereo maps and triangulate distances to objects in the world. This requires that both the intrinsic and extrinsic camera parameters are known. An example of practical camera calibration is described in Section 5. The cameras should be set up in a configuration around the scene where they cover a common field of view. In our experiment described in Section 5 we have used seven cameras in a circle to be able to investigate methods for pose and positioning using a multiple-view setup. However, in a final scanning system fewer cameras in a half circle where one is positioned close to the high resolution sensor may be enough. This will be determined by demands from the application on the scene scanning and the requirements on the tracking and positioning algorithms. 4. IMAGE PROCESSING METHODS Image processing methods can be used for 3D positioning, shape extraction, and pose estimation using the camera data. In addition, the data can be used to extract 3D models of each subject to be matched with the high resolution THz data and gain information about the completeness of the scans. Some of the available image processing methods are described in the following paragraphs. First detection of individuals in the scene is needed. This can be achieved by using a background model and extract the foreground in each frame. Then potential targets can be detected using a tailored detector, e.g., a head detector. Accurate and consistent detections in complex scenes with many subjects or many moving objects are usually difficult to obtain. Therefore detections are often not used by themselves if a robust positioning method is needed. Instead, detections can be passed on to a target tracker which use a model of the possible movements to select feasible tracks. If the scanning system is fast and can capture the interesting part of a body before the subject has moved too far, this sort of tracking together with a silhouette image from the position of the THz sensor may be enough in a scene scanning system. If more precise 3D positioning is needed, full body volume detection and tracking of body parts can be used. An example method for 3D positioning and estimation of shape is extraction of the visual hull. Also here a background model is used and the foreground objects, such as individuals, are segmented in each view and segmentations from all camera views are used to reconstruct the 3D bounding volume of the foreground objects. This is in some sense similar to tomographic reconstruction except that we only have binary images with object and background classes. The visual hull was introduced by Laurentini 1 as the volume which completely encloses an object in a scene given a set of silhouette images. Is is widely used to produce three dimensional models from multiple views. In multi-person scenarios, many persons often occlude each other in some camera views and therefore this type of method often needs many views, which may be impractical. To relax the requirements on the segmentation of the subjects against the background, range sensors can be used to aid the extraction of the visual hull. In addition to positioning subjects, pose estimation is important if subjects are allowed to move in the scene or if the scanning is very slow and subjects can not stand still. In such scenarios data from high resolution Proc. of SPIE Vol. 8022 80220L-3

Figure 2. The actors in the ﬁeld experiment showing the variation of clothes. imagery must be registered to the correct body parts on a model to identify regions that are not yet scanned. Pose estimation can, e.g., be performed by extracting silhouettes from all camera images and matching them to possible poses, selecting the pose which best matches the silhouettes. Pose can also be obtained using dense stereo map or range sensors.2 5. EXPERIMENTAL SETUP A ﬁeld trial was carried out in the summer of 2010. The aim of the trial was to collect data sets to assess scene scanning methods and evaluate speciﬁc image processing methods. In the experiment seven Panasonic HDC SD-700 camcorders recording at 50 fps were used. A ﬁctive circular shaped checkpoint with approximately 12 meters in diameter was built. Cameras and lightning were rigged in a circle, where the cameras were mounted 2.5-3.0 meters above the ground facing a common scene center. Measurements were collected during two days. The ﬁrst day our three main experiments where carried out where four actors walked along three pre-deﬁned paths. The actors wore more and less concealing clothes to get data suitable for evaluation of pose estimation diﬃculty. Figure 2 shows the actors and three sets of clothes. The second day, a smaller experiment where four diﬀerent actors in regular oﬃce clothes were standing in diﬀerent simple poses to use for initial experiments with pose tracking. Camera calibration scenes were also recorded both days. First the video sequences were time synchronized. An example frame is shown in Figure 3. This was performed by detecting a speciﬁc event in all views such as a clap and ﬂashing light. The frame rate is high which makes accurate time synchronization easier. The intrinsic camera parameters (focal length, principal point, and lens distortion) and the extrinsic camera parameters (camera position and orientation) of the seven cameras must be determined to be able to obtain accurate measurements from the video sequences. The intrinsic parameters are calibrated for each camera individually using the camera calibration toolbox for Matlab.3 A checkerboard pattern is moved around to cover Proc. of SPIE Vol. 8022 80220L-4

Figure 3. Seven cameras are synchronized and calibrated to obtain measurements of a common scene. the entire ﬁeld of view of the camera. The corners on the checkerboard are found automatically4 and extracted with sub-pixel precision. A wrapper is used for the main calibration function in the toolbox to automate the intrinsic camera calibration. The extrinsic camera parameters were determined using reference markers on the ground. The coordinates of the markers on the ground, the coordinates of the markers in each camera view and the intrinsic camera parameters are then used together to calculate the position and orientation of the each camera. The current extrinsic calibration method has an accuracy below a couple of centimeters at a distance of ﬁve to six meters (in the scene center). The manual steps in the current calibration method can be reduced and the accuracy can be improved. We are currently investigating automated self-calibration of the rig to facilitate fast deployment of the scene scanning system.5 6. RESULTS FROM THE EXPERIMENTS The measurements from the ﬁeld trial have been used to evaluate a method for scene scanning using extraction of the visual hull of the subjects in the scene. The visual hull is the 3D space which is bounded by the projection of the subject in the camera views. It can be reﬁned when more cameras are added to a scene. The method we use to extract the visual hull is based on silhouette images of the subject. Each of the seven camera views in a frame is segmented into foreground and background using a Gaussian background model. All pixels in the image are given a probability of belonging to the foreground. The background model is also adaptive to gradual changes over time. An example image and the probability of each pixel belonging to the foreground is shown in Figure 4. The visual hull is extracted by projecting the foreground pixels in each camera to the common world coordinates in the scene. The world is divided into cells which gain support from each camera that there is an object of interest occupying the cell. If many (often all) cameras agree that there is something in the cell it is set as object. The visual hull extraction is calculated in levels from the ground and upwards and the resolution of the cells can be set diﬀerently for all three dimensions. The resulting visual hull is represented as cells which are set or not set. It is also possible to set a probability for each cell to contain an object. A visualization of the extracted visual hull is shown in Figure 5. As can be seen in the ﬁgure shadows can aﬀect the extracted visual hull. Also pixels misclassiﬁed as foreground or background eﬀect the result. Proc. of SPIE Vol. 8022 80220L-5

Figure 4. (Left) Image frame from one of the cameras. (Right) Foreground segmentation using the Gaussian background model. Figure 5. (Left) Visualization of the extracted visual hull and its position on the ﬂoor in 3D. (Right) A part of a camera frame used in the reconstruction of the visual hull. These eﬀects can be reduced using an improved model for background extraction. Our example here is used for demonstration purposes and the segmentation parameters have not been tuned to our speciﬁc conditions at this time. Further visualization examples of the visual hull are shown in Figure 6. All examples of the visual hull shown here are extracted using a cell size of 0.01 meters. This small cell size may not be required for an eﬃcient scene scanning method. However, this needs further investigation. A method for scene scanning using visual hull may be used in combination with target tracking to ﬁnd the interesting part of the scene where accurate 3D positioning is needed. This can reduce the processing time required for the scene scanning. The time requirements on the scene scanning system to output new scanning positions will depend on the scanning speed of the THz system. If the scanning is fast the requirement on tracking body parts is somewhat relaxed and if the scanning is slow there will be more time left to process the images and track subjects and their body parts. Proc. of SPIE Vol. 8022 80220L-6

Figure 6. The extracted visual hull of a subject from four different views. 7. CONCLUSION AND FUTURE WORK In this paper we have discussed some of the available techniques for scene scanning. With the data from the field trial we have demonstrated a method for 3D positioning of subjects in a scene, using a visual hull method. This type of method is feasible for scenes with few well separated individuals since occlusion between subjects may create ghost answers. If the scanning using the THz system is rapid, scenarios like a single subject standing in any pose may be solvable in combination with a merging of data from, e.g., the front and back side of the subject. To handle more complex scenes, e.g., where a single subject moves along an appointed path, the visual hull method needs to be extended with tracking functionality. Other methods such as pose estimation from silhouettes may be possible to use directly without extracting the full visual hull as an intermediate step. Pose estimation can also be used to assess scan completeness. Further investigations are needed to evaluate the methods suggested for scene scanning with respect to performance and speed. ACKNOWLEDGMENTS MSB (Swedish Civil Contingencies Agency) and FMV (Swedish Defence Materiel Administration) are acknowledged for funding this study. REFERENCES 1. A. Laurentini, The visual hull concept for silhouette-based image understanding, IEEE Trans. Pattern Anal. Mach. Intell. 16, pp. 150 162, February 1994. 2. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, Real-time human pose recognition in parts from single depth images, in IEEE International Conference on Computer Vision, 2011. 3. J.-Y. Bouguet, Camera calibration toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib doc/. 4. M. Axelsson, P. Follo, and C. Grönwall, Camera calibration using automated identification of checkerboard patterns, in Proceedings of SSBA 2010, Symposium on image analysis, Uppsala, 2010. 5. M. Axelsson, Automatic calibration of a camera positioning system using estimation of the essential matrix, in Proceedings of SSBA 2011, Symposium on image analysis, Linköping, 2011. Proc. of SPIE Vol. 8022 80220L-7