A Visible-Light and Infrared Video Database for Performance Evaluation of Video/Image Fusion Methods

Size: px

Start display at page:

Download "A Visible-Light and Infrared Video Database for Performance Evaluation of Video/Image Fusion Methods"

Samson Hines
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) A Visible-Light and Infrared Video Database for Performance Evaluation of Video/Image Fusion Methods Andreas Ellmauthaler Carla L. Pagliari Eduardo A. B. da Silva Jonathan N. Gois Sergio R. Neves Received: date / Accepted: date 0 0 Abstract In general, the fusion of visible-light and infrared images produces a composite representation where both data are pictured in a single image. The successful development of image/video fusion algorithms relies on realistic infrared/visible-light datasets. To the best of our knowledge, there is a particular shortage of databases with registered and synchronized videos from the infrared and visible-light spectra suitable for image/video fusion research. To address this need we recorded an image/video fusion database using infrared and visible-light cameras under varying illumination conditions. Moreover, different scenarios have been defined to better challenge the fusion methods, with various contexts and contents providing a wide variety of meaningful data for fusion purposes, including nonplanar scenes, where objects appear on different depth planes. However, there are several difficulties in creating datasets for research in infrared/visible-light image fusion. Camera calibration, registration, and synchronization can be listed as important steps of this task. In particular, image registration between imagery from sensors of different spectral bands imposes additional difficulties, as it is very challenging to solve the correspondence problem between such images. Motivated by these challenges, this work introduces a novel spatiotemporal video registration method capable of generating registered and temporally aligned infrared/visible-light video sequences. The proposed workflow improves the registration ac- A. Ellmauthaler Halliburton Technology Center, Rua Paulo Emidio Barbosa,, Ilha da Cidade Universitaria, Rio de Janeiro, -0, Brazil andreas.ellmauthaler@halliburton.com C. L. Pagliari Instituto Militar de Engenharia, Rio de Janeiro, 0-0, Brazil carla@ime.eb.br E. A. B. da Silva Universidade Federal do Rio de Janeiro, Rio de Janeiro, -, Brazil eduardo@smt.ufrj.br J.N. Gois Centro Federal de Educação Tecnológica Celso Suckow da Fonseca, Rio de Janeiro, -000, Brazil jonathan.gois@cefet-rj.br S. R. Neves Instituto de Pesquisas da Marinha, Rio de Janeiro, -0, Brazil sergio@ipqm.mar.mil.br

2 Andreas Ellmauthaler et al. curacy when compared to the state-of-the art. By applying the proposed methodology to the recorded database we have generated the VLIRVDIF (Visible-Light and Infrared Video Database for Image Fusion), a publicly available database to be used by the research community to test and benchmark fusion schemes. Keywords Infrared/visible image/video database image registration image fusion camera calibration Introduction Multi-camera setups are particularly effective in environments where a single camera is incapable of capturing the entire information available within the monitored scene. Scenes shot with visible-light cameras usually exhibit good textural information, whereas infrared (IR) imagery may exhibit objects that are invisible to visible-light sensors, provided they can measure the heat emissions to form an image. An imaging system could produce two disctinct images/videos of the same scene or a single image/video containing information from various cameras positioned close to each other. The latter is of special interest since it exceeds the physical limitations of a single sensor within a single image. In particular, it is used in spatial and temporal super-resolution frameworks as well as in image fusion applications for the purpose of increasing the overall depth of focus, the overall dynamic range and the overall spectral response of an imaging system []. In order to develop image fusion algorithms one needs a large and diverse set of image pairs. A variety of high quality data is useful to validate any method, testing its overall ability to provide results with real images, under real-life conditions. Datasets for image fusion research should contain video footage from different scenarios, designed to challenge fusion algorithms. We created the VLIRVDIF (Visible-Light and Infrared Video Database for Image Fusion) to contribute with registered and synchronized videos from the IR and visible-light spectra, aiming to aid researchers working in image/video fusion. The database contains registered IR/visible-light video sequence pairs, as well as their raw, unprocessed counterparts, and is freely available for download at []. By doing so, we hope to contribute to alleviate the problem faced by most researchers in multimodal image fusion that is the shortage of registered and synchronized videos for evaluation purposes. A major concern when creating the VLIRVDIF database was to generate a dataset for image/video fusion purposes under varying lighting conditions and challenging environmental conditions. Therefore, of the sequences were captured under high temperatures, between o C and o C, causing the vegetation and objects to be warmer or as warm as the human subjects in the scenes. In addition, while some sequences were shot at night with little or no illumination, others were acquired under controlled indoor illumination. The scenes present concealed objects under clothes or objects, that are only captured by the IR sensors. In addition, it was designed to provide non-planar scenes, imposing challenging conditions to image fusion algorithms. When properly combined, visible-light and IR footage could provide videos that can be useful for many applications. To the best of our knowledge, there are a handful of publicly available databases which are suitable for research in multisensor image and video fusion. The generation of such databases has to tackle several problems, including the creation of challenging scenarios and the inherent difficulties in producing registered source images at sub-pixel accuracy originating from image sensors operating at different spectral bands. This latter problem motivated us to develop an algoritm to register pairs of spatio-temporally

Title Suppressed Due to Excessive Length Temporal Alignment Rectiﬁcation Registration Figure Schematic diagram of the proposed IR/visible-light video registration framework.

0 0 0 misaligned IR and visible-light videos of the same dynamic scene recorded from distinct yet stationary viewpoints.

This can be achieved by jointly calibrating the employed cameras, that is, computing their optical properties (intrinsic parameters) as well as the relative positions of the individual cameras with

3 Title Suppressed Due to Excessive Length Temporal Alignment Rectiﬁcation Registration Figure Schematic diagram of the proposed IR/visible-light video registration framework. In the the superimposed pseudo-color images on the right, the visible-light and IR images occupy the green and red channels, respectively misaligned IR and visible-light videos of the same dynamic scene recorded from distinct yet stationary viewpoints. Therefore, independent of the underlying application, it is of vital importance that the images are represented in a common reference coordinate frame. This can be achieved by jointly calibrating the employed cameras, that is, computing their optical properties (intrinsic parameters) as well as the relative positions of the individual cameras with respect to each other (extrinsic parameters). Based on these calibration parameters the images can subsequently be undistorted and rectified such that the pixel coordinates in one image sequence are in direct correspondence to pixel coordinates in the other image sequence. In the course of this work we will refer to this process as image registration. Fig. shows the schematic workflow of the developed IR/visible-light video registration framework. In general, both traditional [ ] and self-calibration [, ] methods are wellsuited for registering image sequences originating from cameras operating in the same spectral band. However, they tend to face problems for sequences obtained by sensors of different modalities (such as IR and visible-light sensors). For self-calibration methods this is mainly due to the possible lack of mutual feature-points or common scene characteristics within corresponding input images. These problems are less severe for traditional calibration methods. Moreover, the construction of a calibration board, whose interest points appear likewise in the IR and visible-light spectrum and allow for an accurate calibration of the employed cameras, is not a trivial task. As a consequence, only a few approaches to IR/visible-light stereo camera calibration can be found in the literature [,,,]. A more detailed discussion on camera calibration is presented in Section. The developed approach uses a planar calibration board equipped with miniature light-bulbs to register an IR/visible-light image sequence pair misaligned in space and time. The large number of light bulbs makes the registration process more robust against the lack of mutual scene characteristics, a common source of problems when registering video sequences originating from different spectral modalities. The processing chain first determines the exact light bulb positions in the individual frames of an IR/visible-light video sequence and utilizes this information to estimate the temporal offset. This is followed by the camera calibration process which is used to rectify and remove distortion from the images. We show that the developed system is able to estimate the temporal offset with a high confidence level. Furthermore, the introduced calibration scheme leads to calibration results which exhibit significantly smaller MREs (mean reprojection errors) when compared to the state-of-theart. Examples of the effectiveness of the developed framework for generating pairs suitable for image fusion, where co-registered images at sub-pixel accuracy are required [], can be found in [].

4 Andreas Ellmauthaler et al. The remaining part of this paper is organized as follows. Section presents the databases available for public domain that are suitable for image and video fusion reserach, discusses the requirements that guided the creation of the VLIRVDIF dataset, and describes its design. Next, Section reviews the necessary camera calibration steps, while the temporal alignment, already presented in [], is reviewed in Section. The developed IR/visible-light camera registration scheme is described in detail in Section. In Section the experimental results, obtained when applying the proposed framework in order to build the VLIRVDIF from misaligned/unregistered IR/visible-light video pairs, are presented. Conclusions and future plans are given in Section Image Fusion Database Image fusion can be summarized as the process of integrating complementary information from multiple images into a composite representation containing a better description of the underlying scene than any of the individual source images could provide [0]. These techniques are particularly necessary in environments where a single sensor type is not sufficient to expose the whole scene content. For instance, images captured in the visible spectrum usually exhibit good textural information but tend not to contain objects located in poorly illuminated regions or behind smoke screens. IR imagery, on the other hand, does not suffer from these shortcomings but generally lacks textural information []. Thus, the combination of visible and IR sensors may lead to a composite representation where both textural information (visible image) and complementary information from the IR spectrum are depicted in a single image. The effectiveness of such fusion systems has been demonstrated for many tasks such as target tracking, concealed weapon detection, remote sensing and robot navigation, among others []. An invaluable quality of a video surveillance system is to be effective both at day and night, as well as under different illumination conditions. As algorithms that rely on two different electromagnetic spectrum bands can be highly effective, we created a database that may also be suitable for video surveillance and military applications, besides object tracking and video understanding. In order to address these requirements, we defined the context of the video (e.g. indoor, outdoor, sunny day) and its content. The content is an object in the scene (e.g., a structure, a person, a vehicle) or an event (e.g., people interacting, people walking). From this a concept list was defined by integrating the items from the context and content sets. The idea was to create concepts for different applications presenting different scenarios of interest. These include indoor and outdoor surveillance sequences, people interacting, people leaving a package with a concealed weapon, people walking, people hiding in the woods, people hiding behind a smoke screen, vehicles and boats, all under varying illumination and temperature conditions. Subsection. surveys datasets that are available for the research community, and Subsection. describes the VLIRVDIF designed and generated in this work. 0. Public IR/visible databases In [] the authors used a single axis camera to combine an IR and a visible-light camera to capture 00 image pairs of indoor and outdoor scenes. The cameras alignment was achieved by a beam-splitter and the visible-light images had to be scaled down and further aligned

5 Title Suppressed Due to Excessive Length using a manually computed homography. No videos were produced and the unregistered image pairs available at []. The test images used in [] were published in [] and []. The latter can be downloaded from []. When using this database, all image pairs have to be registered before the application of any fusion method. The image data published in [] is not publicly available. It was created for human silhouette extraction purposes, where color and IR cameras recorded people walking in indoor environments. The Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS) Benchmark Dataset Collection [] provides benchmark datasets for testing and evaluating computer vision algorithms using images beyond the visible spectrum. These datasets comprise: person detection in thermal imagery (only IR images); unregistered thermal and visible face images under different conditions (unregistered IR/visible image pairs); color/thermal manually registered images; faces for facial analysis with thermal imagery; different scenarios with thermal imagery for detection and tracking purposes, weapon detection and weapon discharge detection with thermal imagery; face images (visible-light only) illuminated by IR lights to overcome the problem of illumination variation for face recognition applications []. The Ohio State University (OSU) Color and Thermal Database provides sequences with color/thermal images registered using homography with manually-selected points. This dataset, referenced in [], was created for the development of pedestrian tracking algorithms that require thermal and visible streams to be co-registered. It belongs to the OTCBVS collection []. The Eden project multi-sensor dataset [], reported in [0], provides multi-sensor videos for target tracking purposes. The IR and visible-light videos were manually registered, with the correspondences being selected throughout each video sequence. Whenever possible, the videos from the two spectral bands were synchronized with the help of an omnidirectional flash, visible in both spectral bands, activated at each scene shot. Only one scenario is available at [] in visible, IR, side-by-side IR/visible and fused versions. The International Society of Image Fusion [] provides links to websites containing databases of interest to information fusion researchers. However, there are not neither registered nor synchronized IR/visible-light image sequences. An imaging system for computing sparse depth maps from multispectral images is proposed in [] along with a visible-light and IR image dataset [] together with its associated ground-truth. In [] the accuracy of similarity measures for thermal-visible image registration of human silhouettes is investigated using a registered and synchronized video dataset, that is part of the LITIV dataset []. In the LITIV dataset humans are walking in a scene in various depth planes. Another part of the LITIV dataset was introduced in [] addressing the issues of IR/visible video registration, sensor fusion, and people tracking for far-range videos. A dataset for maritime imagery recognition was proposed in []. It is publicly available at [] and comprises unregistred visible and infrared ship imagery. The National Optics Institute (INO) Video Analytics Dataset [] comprises visiblelight and IR registered sequence pairs, where sequences are available at 0 frames per second (fps) and one at fps. The videos include parking lots, vehicles and people. The Visual Analysis of People Lab created the Stereo Thermal Dataset [], containing video sequences from two synchronized thermal cameras, with 0 0 pixels at 0 fps, captured under sunny conditions with a temperature of approximately 0 o Celcius. The videos exhibit pedestrians in scenarios that present a high degree of occlusion. The dataset has no visiblelight spectrum data.

6 Andreas Ellmauthaler et al The VLIRVDIF (Visible-Light and Infrared Video Database for Image Fusion) There are many challenges involved in jointly processing IR and visible-light spectral bands. Therefore, when creating a database for IR/visible-light image/video fusion purposes one needs to take these into consideration by imposing several conditions to the scenes. Usually, low contrast levels in the visible-light spectrum severely affect the performance of image/video processing methods. The lower the contrast of a scene, the worse the image/video processing algorithm. Therefore, some scenarios should include different illumination conditions such as bright sun light and nighttime footage, as well as illumination changes across the video. Additionally, smoke screens and objects moving into shaded areas could be introduced to stress object tracking in the visible band. Also, people wearing camouflage uniforms hiding behind vegetation increase the difficulties in the visible-light side. Challenges for the IR sensor are imposed by high outdoor temperatures (very close to the average human body temperature) under bright sun light and indoor controlled illumination variation. Moreover, people transporting objects that are concealed to the visible band sensor (e.g. behind newspapers and vegetation or inside bags) should be also part of the content. Therefore, both video context (e.g. indoor, outdoor, sunny day) and their content (e.g. structures, persons, hidden objects) offer challenging scenarios for testing video registration and fusion algorithms. Taking these into account, we designed the VLIRVDIF, consisting of different video sequences, manually recorded at distinct locations. Table gives an overview of the main properties of the recorded video sequences, including a rough summary of the scene contents as well as the prevailing environmental conditions. Selected scene thumbnails can be seen in Fig.. The four outdoor sets of video sequences were shot under different conditions. The seven Camouflage sequences were acquired under bright sunlight and high temperatures. The sequences exhibit people wearing both civilian and camouflage outfits passing through and/or hiding behind vegetation. In addition, some sequences present smoke screens transparent to the infrared wavelengths. In general, smoke screens are used to obscure the visible and infrared radiation lights as electro-optical countermeasures. The smoke screen present in some Camouflage sequences conceals objects in the visible region only. In some Camouflage sequences there are people carrying weapons that are hiding behind vegetation and/or the smoke screen. The Patio sequence, shot in the twilight, displays one person concealed in the vegetation with a background consisting of several people passing by a corridor. A particular feature of the Camouflage and Patio sequences is given by their scene planes. The Camouflage sequence exhibits two dominant scene planes at distances of 0m and 00m, respectively, whereas the Patio sequence features an inclined scene plane with distances varying in the range of -0m. This poses a significant challenge to image fusion algorithms. The outdoor sequence Trees comprises four different scenes all shot under bright sunlight, with some scenes acquired under backlighting conditions. One of the Trees scene displays a person concealed under the shade of trees, while the others show two people crossing a lawn, either hiding into a shaded area or emerging from a shaded area. There is also a scene with a car that stops and picks up one of the hidden persons. The nightime sequences Guanabara Bay disclose a view of the Guanabara Bay and the Rio de Janeiro - Niterói bridge with vehicles crossing and vessels navigating the bay. The two sets of indoor video sequences, Lab and Hangar, were shot under artificial light and controlled temperature. The five Lab sequences display two people carrying bags with weapons that are concealed only to the visible wavelengths. Moreover, there are scenes where objects are hidden behind newspapers, as well as scenes where objects con-

7 Title Suppressed Due to Excessive Length Table Overview of the sequences from the VLIRVDIF database []. Name Scene Content Environmental Conditions Camouflage Sequences with outdoor scenes Bright sunlight People hiding behind vegetation and/or smoke screen o C dominant scene planes at distances of approx. 0m and 00m, respectively Lab Sequences with indoor scenes Artificial light People walking around, hiding weapon-like items o C within bags and behind newspapers Distance to scene plane approx. m Patio Sequences with outdoor scenes Twilight Several people passing by a corridor; One person hiding o C behind vegetation Varying distance to scene plane (m-0m) Trees Sequences with outdoor scenes Bright sunlight persons crossing a lawn and hiding into a shaded o C area; Crossing Car; One person concealed under the shade of trees. Distance to scene plane approx. 0m Hangar Sequences with outdoor scenes Artificial light Several people crossing a dimly lit corridor. Distance o C to scene plane approx. 0m Guanabara Sequences with outdoor scenes Nighttime Bay View of the Guanabara Bay and the Rio de Janeiro - o C Niterói bridge. Vehicles crossing the bridge; Ships on Guanabara Bay; Distance to scene plane approx. 00m (a) (b) 0 Figure Employed calibration board consisting of light bulbs, arranged in a matrix, in the (a) visible-light and (b) IR spectrum. The depicted images were taken from an IR/visible-light image sequence after temporal alignment. cealed in bags are left unattended. Several people crossing a dimly lit corridor are the main subject of the Hangar sequence. In one of the sequences a person carries a concealed weapon in a bag, while others start a conversation. The light fades in and out and occasionally blinks. Independent of the scene content, each IR/visible-light video pair starts off by exhibiting different poses of the calibration board shown in Fig.. These poses include translational and rotational movements of the calibration board and were chosen in such a way that both temporal and spatial alignment can be performed simultaneously using the same calibration footage (see Sections., and ). The employed test setup consisted of a portable tripod (Fig. ) on which an IR and a visible-light camera were rigidly mounted side-by-side. The viewing angle and the zoom

8 Andreas Ellmauthaler et al. 0 Figure Used test setup consisting of an IR (left) and a visible-light camera (right) mounted side-by-side. of the employed cameras were manually adjusted in such a way that the observed overlap between the field-of-view of both cameras was as large as possible. The IR video sequences were obtained by recording the analogue NTSC video output of a FLIR Prism DS camera, operating at a spectral range of.µm to µm (Mid-Wavelength IR). In order to convert the analogue video stream to digital video, a Pinnacle Dazzle Digital Video Creator 0 video capturing device was utilized. In accordance with the NTSC standard, the resultant video exhibits a resolution of 0 0 pixels (which differs from the native 0 pixels resolution of the employed IR camera). As for the visible-light video sequences a Panasonic HDC-TM00 camera was employed. These videos were recorded at a resolution of 0 00 and subsequently downsampled and cropped to match the IR video resolution of 0 0 pixels. Both IR and visible-light video sequences were recorded at a rate of 0 frames per second (0 fps). The VLIRVDIF is publicly available at []. It contains both raw, unprocessed visible-light and IR video sequences, as well as their registered and synchronized counterparts. Moreover, the scenes were recorded varying different parameters, such as distance of the pair of cameras to the scene plane, illumination, rotation of the cameras, zoom, movement of the targets in the scene, occlusions, among others. The idea behind the variation of these parameters is twofold: to stress the registration method and to obtain a larger diversity in the database. 0 0 Camera calibration Although the calibration procedure employed in this work has already been published in [0], we review the necessary mathematical concepts involving camera calibration and calibration point localization to allow a proper understanding of the registration process. For this purpose, we start by describing how D scene points can be accurately mapped onto a D image plane and derive the corresponding camera model. Next, based on the single camera model we review the epipolar geometry of two views and address the question of how the knowledge of the position of an image point in one view constrains the position of the corresponding point in the other view. In the course of this work the following notation is used: Homogeneous D coordinates X = [X Y Z ] T are represented by bold, capital letters whereas homogeneous D coordinates x = [x y ] T are represented in boldface, lowercase letters. Their inhomogeneous

9 Title Suppressed Due to Excessive Length counterparts are denoted by X = [X Y Z] T and x = [x y] T, respectively. As for stereo camera calibration, we use the superscript to indicate entities associated with the second view Single Camera Calibration In the basic pinhole camera model an image point in D is represented by the homogeneous vector x and its counterpart in the D world coordinate system by the homogeneous vector X. The general mapping given by the pinhole camera can be expressed by [] α x s x 0 µx = K [R t] X, with K = 0 α y y 0, () 0 0 where µ is an arbitrary scale factor, R and t are the extrinsic camera parameters and K is called the intrinsic camera matrix [] or camera calibration matrix []. The parameters of the rotation matrix R and the translation vector t represent the placement of the world coordinate system with respect to the camera coordinate system whereas K contains the internal camera parameters in terms of pixel dimensions. These are the focal length (α x, α y ) and the principal point (x 0, y 0 ) of the camera in the x and y direction, respectively, as well as the parameter s which describes the skewness of the two image axes. In this work we focus on finite cameras corresponding to the set of homogeneous matrices P = K[R t] for which the left hand submatrix KR is non-singular. When using a calibration device we can assume, without loss of generality, that the calibration pattern is located on the plane Z = 0 in the world coordinate system. Thus, we can rewrite eq. () such that X x µ y = K [r r r t] Y X 0 = K [r r t] Y = H X, () where R is given by [r r r ], H = K [r r t] is called a homography matrix and X = [X Y ] T. Lens distortion can be incorporated using the following expression [, ] F( x c, K, P) = [ ( xc k r + k r +... ) + ( p x c y c + p (r + x c) ) ] ( y c k r + k r +... ) + ( p (r + yc ), () ) + p x c y c where x c = [x c y c ] T are the (non-observable) distortion-free, normalized points in the camera coordinate system before applying the camera calibration matrix K, K = {k, k,...} and P = {p, p } are the coefficients of the radial and tangential distortion, respectively, and r = x c + yc. The (observable) distorted, normalized points x d are then approximated by x d = x c + F( x c, K, P) () and the final image points are given by x = K x d. In this work a nd order radial distortion model with tangential distortion is used such that K = {k } and P = {p, p }.

10 0 Andreas Ellmauthaler et al. 0 With all this in mind, a final global optimization step is incorporated which estimates the complete set of parameters using the previously obtained calibration parameters as an initial guess. This optimization is done iteratively by minimizing the following functional [] x ij x ( ) K, K, P, R i, t i, X j, () i j where x ij is the sub-pixel position of the j th calibration point in the i th calibration image, and x ( ) K, K, P, R i, t i, X j is the projection of the corresponding calibration point X j from the D world coordinate system. Given the calibration point positions in the real world and camera coordinate system, various off-the-shelf solutions for camera calibration exist. Among them, the OpenCV Camera Calibration Toolbox [] as well as the Camera Calibration Toolbox for Matlab [] are predominately used.. Stereo Camera Calibration In this subsection we formally define the epipolar geometry between a pair of images. As before, we start with the basic pinhole camera model which does not assume lens distortion. Suppose a D scene point X is imaged at the point x in the first view and at x in the second view. Then, corresponding points x x satisfy the epipolar constraint [] x T Fx = 0, () 0 where F is called the fundamental matrix of the camera pair. An important property of the fundamental matrix is that it is of rank. Hence, F does not provide point-to-point correspondences. Instead it specifies a map x l from a point in one image to its corresponding epipolar line in the other image []. Assuming that both cameras, represented by the matrices P and P, have been calibrated according to the pinhole camera model such that P = K [I 0] P = K [R t], () where, without loss of generality, we choose the world origin to coincide with the first camera P, then the fundamental matrix can be expressed by [] F = [ K t ] K RK, () where we use the notation that the -vector [ K t ] defines a skew-symmetric matrix such that the vector product a b = [a] b, and R and t describe the relative rotation and 0 displacement of the two cameras, respectively. Due to the linearity of eq. (), the fundamental matrix provides a simple and computationally friendly solution to compute point-to-line correspondences within a stereo camera setup. However, for real cameras employing optical lenses such a linear mapping is no longer valid. To this end, the mapping of image points from the first view to the second view in the presence of lens distortion can be summarized as follows: First, apply the inverse camera calibration matrix to the D image points in the first view x d = K x. Next, in order to obtain the distortion-free, normalized points x c, the inverse distortion model of eq. () needs to be employed to x d. However, this is not straightforward since no analytic solution for

11 Title Suppressed Due to Excessive Length 0 the inverse exists. One way to bypass this problem is to approximate the inverse distortion model recursively [, ] x c x d F( x d, K, P) x d F( x d F( x d, K, P), K, P) x d F( x d F( x d F( x d, K, P), K, P), K, P).... where F is defined in eq. (). By doing so, the error introduced when substituting x d with x c on the right-hand side gets smaller at each iteration. As was shown in [, ] three to four iterations are sufficient to compensate strong lens distortions. Next, the undistorted points x c are mapped from the first camera coordinate system through the plane at infinity [] to the camera coordinate system of the second camera [] (x c = Rx c ) and lens distortion is added using the forward lens distortion model of eq. () such that x d = x c + F( x c, K, P ). Finally, by applying the camera calibration matrix (x = K x d), a potential match of x in the second view is found. Please note that, as a consequence of lens distortion, the previously established point-to-line correspondences no longer hold. Instead if points x and x correspond, then x lies on a curved epipolar line controlled by the polynomial distortion function of eq. (). Besides the chosen camera model, the overall accuracy of camera calibration depends to a great extent on the ability to localize a set of calibration points within the provided calibration footage. In the next section we introduce the calibration point detection scheme for IR/visible-light imagery from [0]. It is capable of localizing corresponding calibration point pairs in the IR and visible-light spectra with high accuracy. () Calibration Point Detection Due to the different spectral sensitivity of IR and visible-light cameras, the construction of a calibration board whose interest points appear both in the visible-light and IR spectra is not a trivial task. For example, existing camera calibration approaches based on black/white calibration patterns cannot be employed straightforwardly since, in most cases, such calibration devices do not appear in the IR image. The calibration board chosen in this work uses miniature light bulbs, equidistantly mounted on a planar calibration board [, 0]. This configuration is of special interest since, when energized, heat and light are simultaneously emitted by the light bulbs causing the calibration pattern to appear in both the visible-light and IR modalities. This is demonstrated in Fig., where the employed calibration board consisting of light bulbs, arranged in a matrix, is shown in the visible-light and IR spectra, respectively. Other approaches are presented in [], [] and []. The main advantages of the chosen calibration board include its versatility (e.g. the calibration board can be used for daytime and nighttime recordings), its fast operational readiness ( plug & play ) and portability. Moreover, since the same physical entities (light bulbs) are used as calibration points in the IR and visible-light images, eventual imperfections of the calibration board (e.g. loose contact of one of the light bulbs) can be compensated more easily. Nevertheless, when observing Fig. some challenges associated with the chosen calibration board can be identified. Due to the use of cheap, off-the-shelf light bulbs, the emitted radiation pattern tends to differ from light bulb to light bulb - a problem which is further aggravated when tilting the calibration board. In extreme cases, this may even lead to the fading of some light bulbs. In addition, the visibility of the light bulbs in the visible-light image depends to a large extent on the surrounding lightning conditions. For example, for

12 Andreas Ellmauthaler et al. outdoor sequences recorded at bright day light, the calibration points are less noticeable than for indoor scenes where the lightning conditions can be controlled. In order to cope with these challenges, a series of steps are proposed in [0] to robustly extract the sub-pixel positions of the miniature light bulbs along all video frames exhibiting the calibration board of Fig Calibration point localization In order to compute the exact sub-pixel positions of the miniature light bulbs along all IR/visible-light video frames exhibiting the calibration board of Fig., we first have to separate the light bulb regions from the background. Ideally, this would be accomplished by applying a static threshold to the calibration images, labeling all pixels above the threshold as belonging to a potential light bulb region. However, due to the varying appearance of the light bulbs, no global threshold is capable of reliably producing a binary image that contains all light bulbs whilst suppressing wrongly extracted background regions. Thus, the approach adopted in this work does not rely on a single global threshold but tries to extract the exact light bulb positions by iteratively determining the optimal threshold for each calibration image. For this purpose, we first choose an initial threshold (either manually or by means of some adaptive thresholding scheme as the one in []) which is subsequently used to binarize the calibration image. After the thresholding operation, the extracted light bulb regions are expected to exhibit ellipse-like patterns in the binarized image. Based on this assumption, we post-process the binary image by removing all regions which appear with arbitrary shape and do not resemble the expected ellipsoidal radiation pattern. This is accomplished by fitting an ellipse to the boundary pixels of each region and discarding those for which the committed error (defined as the sum of squares of the distances between the boundary pixels of the region and the fitted ellipse) is above some threshold. Furthermore, we also remove those regions corresponding to ellipses with large eccentricity (measure of how much the ellipse deviates from being circular) since it is assumed that the ellipses corresponding to light bulb regions closely resemble a circle. In our implementation the ellipse fitting is performed by employing the algorithm of []. The described procedure mitigates the problem of fast movements, hand shaking and eventual distortions of the light bulb region. A first estimate of the calibration point positions is obtained by substituting the original light bulb regions with the area of the computed ellipses and calculating their centroids within the original calibration images. If the number of computed calibration points is below the overall number of light bulbs we repeat the above procedure using the next lower threshold. If on the other hand the number of extracted calibration points is larger than the number of light bulbs, a potential solution is to randomly choose a subset of calibration points from the complete set and to compute the corresponding homography using the DLT algorithm. If the correct set is chosen, mapping the calibration points from the D world coordinate system to the calibration image results in a small MRE, defined as MRE = x i H X i. (0) N i Here, N is the total number of light bulbs, x i is the estimated position of the i th calibration point within the calibration image and X i represents the position of the corresponding calibration point in the world coordinate system. If, on the contrary, the MRE is high, we have

13 Title Suppressed Due to Excessive Length strong evidence that the chosen subset does not correspond to the true light bulb positions and another subset needs to be chosen. Even though this procedure was found to be robust, it is computationally expensive when the number of extracted regions is much larger than the actual number of light bulb regions. In fact, in a scenario with k light bulbs and n extracted regions with n > k, the combinatorial n! complexity of this approach corresponds to different combinations. It is easy to k!(n k)! verify that the number of possible combinations grows exponentially with the number of extracted regions. For instance, for the case of light bulbs and, and extracted regions, respectively, the overall number of combinations is, 0 and. Thus, in situations where the ratio of extracted calibration points to light bulbs renders the above mentioned method impracticable, a preliminary step for outlier removal is needed. This is done by exploiting the available information about the light bulb distribution on the calibration board. In more detail, assuming that the distances between pairs of adjacent light bulbs are approximately constant within the calibration images, we iteratively eliminate the calibration points whose mean distances to their closest neighbors differ most from the median of distances, calculated over the whole set of extracted regions. This procedure is repeated until the combinatorial complexity for the aforementioned method is reduced to an acceptable degree such that it can be used to remove all remaining wrongly extracted calibration points without a high computational overhead. If the number of extracted calibration points matches the number of light bulbs, and the corresponding MRE is below a pre-defined threshold, then the final calibration point positions can be computed. Since our goal is to obtain calibration points for which the MRE is as small as possible we refine the corresponding homography H by minimizing the functional min xi H X i. () H i The final calibration point positions are computed by applying the refined homography to the calibration point positions in the world coordinate system. Fig. a and b show the resulting calibration point positions for the visible-light and IR calibration image of Figs. a and b, respectively. After the exclusion of the outliers, the first estimate of the calibration points are the centers of gravity of the selected areas. However, due to reflection, plate inclination and binarization threshold adjustment, the centers of gravity do not always represent the best calibration points. Minimization can be described with the following steps: (i) The calibration points are defined as the centers of gravity; (ii) The homography is calculated using the DLT algorithm [] (note that the world coordinates system points are known by the construction of the board); (iii) The homography is applyied to the world real points and the position of calibration points is updated; (iv) Steps (i-iii) are repeated until the MRE is acceptable or does not change. The next section shows that, by means of the extracted calibration point positions, the time-shift between two unsynchronized IR and visible-light sequences can be successfully determined. A preliminary version of this temporal alignment was published in []. 0 Temporal Alignment Let S V and S I be two video sequences N V and N I frames long, recorded at the same frame rate by a visible-light and an IR camera, respectively, exhibiting different poses of the

Andreas Ellmauthaler et al. (a) (b) Figure Results of the calibration point detection for the (a) visible-light and (b) IR calibration images of Fig. (zoomed version).

14 Andreas Ellmauthaler et al. (a) (b) Figure Results of the calibration point detection for the (a) visible-light and (b) IR calibration images of Fig. (zoomed version) (a) (b) Figure Global movement of all calibration points along a (a) visible-light and (b) IR video sequence. Each line represents the vertical movement of a single calibration point. Bright pixel values indicate an upward movement whereas dark pixel values represent a downward movement of the calibration board. calibration board of Fig.. Finding the temporal offset ˆt between the two video sequences S V and S I is equivalent to maximizing a similarity measure s( ) over a set of potential temporal offset candidates t such that 0 ˆt = arg max t s ( S V, S I, t ). () The temporal alignment approach proposed in [] starts off by recording alternating translational movements of the calibration board in the downward and upward directions. This is followed by the extraction of the calibration point positions in each frame of the IR and visible-light video sequence as elaborated in Section.. Based on the extracted calibration point positions, we determine the vertical component of speed of each calibration point along the video sequences. This is accomplished by subtracting the y-coordinates of the calibration point positions between two successive video frames. Fig. depicts the global movement of all calibration points with each line representing the overall vertical movement of a single calibration point. In both images, brighter pixel values indicate the displacement of the calibration board in the upward direction whereas darker pixel values suggest a downward movement of the calibration pattern. Based on Fig., the temporal offset between the two video sequences can be determined in a straightforward manner. It simply corresponds to the horizontal displacement between the two images for which their horizontal cross-correlation is maximized. More specifically, given a temporal offset candidate t, the similarity between the visible-light sequence S V and the IR sequence S I is given by

15 Title Suppressed Due to Excessive Length 0 s(s V, S I, t) = M M m= n N M V (m, n t)m I (m, n) m= n N ( ) M M V (m, n t) k= l N, () ( ) M I (k, l) where the matrices M V (m, n) and M I (m, n) respectively express the displacement of the m th calibration point between two consecutive visible-light and IR frames at time instant n, M is the number of calibration points, N V and N I are, respectively, the number of visiblelight and infrared frames considered, and N = {n (n t) N V n N I }. Please note that the similarity measure of eq. () is restricted to the interval [, ]. The two video sequences are considered to have coincident movements if the similarity measure is and opposite movement if the result is. A result of 0 implies that no similarities between the two sequences could be found. As expressed in eq. (), the best estimation of the temporal offset ˆt between the IR and visible-light video sequence is the one for which eq. () is maximized. Fig. shows the result of the temporal alignment for two IR/visible-light video sequences corresponding to Fig.. The highest similarity (according to eq. ()) is obtained for a temporal offset t of frames. This result corresponds well with Fig. which, when subjectively assessed, suggests a time-shift of approximately 00 frames between the two sequences. 0. Similarity t Figure Result of the temporal alignment for the two IR and visible-light video sequences corresponding to Fig.. The highest similarity (according to eq. ()) between the two video sequences is obtained for t = frames. 0 Image Registration Once the IR/visible-light video sequence pair is synchronized, the individual and joint camera parameters of the IR/visible-light camera pair can be estimated. This is accomplished by choosing N temporally aligned calibration images and following the calibration procedure outlined in Section.. Please note that in the current implementation the calibration images where chosen manually such that a large variety of different poses of the calibration board is incorporated in the calibration process. However, this process can be automated by extracting the pose information directly from the homography matrices [, ].

Andreas Ellmauthaler et al. 0 0 A potential drawback of the proposed approach is that the calibration point localization as described in Section.

In order to improve calibration results, it is therefore beneficial to first map the calibration images onto an undistorted fronto-parallel view (see Fig.

16 Andreas Ellmauthaler et al. 0 0 A potential drawback of the proposed approach is that the calibration point localization as described in Section. is performed using non-fronto-parallel calibration images which suffers from nonlinear distortions due to the camera optics. In order to improve calibration results, it is therefore beneficial to first map the calibration images onto an undistorted fronto-parallel view (see Fig. ) and determine the exact calibration point positions within these canonical images. However, in order to do so, full knowledge of the calibration parameters would be necessary - information that is usually not available at this point. One possible solution to this problem is presented in [] where the authors advocate an iterative refinement approach, using alternating mappings of the calibration images onto a canonical fronto-parallel view and back. In this work we follow a similar approach. After computing a first preliminary version of the calibration parameters we remove the radial and tangential distortion from the calibration images and map them onto a canonical fronto-parallel plane in the world coordinate system. Within this fronto-parallel view we then localize the calibration points using the processing chain of Section.. Finally, these new calibration points are remapped onto the original image plane and the camera parameters are recomputed using the updated calibration point positions. This process is repeated until convergence, where in each new iteration the mapping onto the fronto-parallel plane is performed using the camera parameters from the previous iteration. Fig. shows the undistorted equivalents of Fig. in the fronto-parallel view. As shown in Section, the calibration parameters obtained by means of this iterative calibration point refinement result in reprojection accuracies exceeding the ones of traditional IR/visible-light camera calibration approaches [,, ], as well as the novel model proposed by []. (a) (b) Figure Undistorted views of the calibration boards of Fig. in the fronto-parallel plane. (a) Visible-light image. (b) IR image. 0 After completing the individual calibration procedures for the IR and visible-light camera we jointly calibrate them as described in Section.. By doing so, we gain knowledge of the relative displacement of the two cameras, consequently enabling us to map points from one view to the other one. As previously pointed out, due to lens distortion this mapping is not linear in the sense that a point in one view does not correspond to a line in the other view. Instead a curved line is generated on which the corresponding points in the second view reside. This is demonstrated in Fig. where the epipolar curves resulting from mapping the IR calibration points of Fig. b to the visible-light calibration image of Fig. a are highlighted. It can be observed that the distances between the epipolar curves and the corresponding calibration points are very small, indicating the high accuracy of the stereo calibration results.

Title Suppressed Due to Excessive Length Figure Result of stereo calibration when mapping the IR calibration points of Fig. b to the visible-light calibration image of Fig. a.

0 0 Next, based on the obtained epipolar geometry we rectify the IR/visible-light image pairs [, ], resulting in image correspondences where the epipolar curves are linearized and run parallel to the

17 Title Suppressed Due to Excessive Length Figure Result of stereo calibration when mapping the IR calibration points of Fig. b to the visible-light calibration image of Fig. a. Note that due to lens distortion this mapping is not linear, resulting in curved epipolar lines. 0 0 Next, based on the obtained epipolar geometry we rectify the IR/visible-light image pairs [, ], resulting in image correspondences where the epipolar curves are linearized and run parallel to the x-axis. By doing so, disparities between the IR and the visible-light images occur in the x-direction only. In this work rectification is achieved by undistorting both image sequences using eq. () and applying two rectifying homographies HR and H0R to the undistorted IR and visible-light images, respectively. Thus, after rectification, point correspondences are given by [] 00 0 where F = 0 0 () x0t H0T R F HR x = and x and x0 represent two corresponding image points taken from an undistorted IR/visible-light image pair. As a consequence the epipoles e and e0, corresponding to the right and left null space of F, are mapped to the point p = [ 0 0]T at infinity. Since all epipolar lines must pass through their corresponding epipoles it is easy to verify that all epipolar lines run parallel to the x-axis and, in effect, all corresponding image points have identical y-coordinates. Fig. shows the result of rectification for an arbitrary IR/visible-light image pair from sequence Trees. Notice that due to the different fields-of-view of the employed IR/visible-light camera pair, after rectification, the visible-light image is completely contained within the corresponding IR image. Moreover, Fig. also illustrates the effect of distortion removal. This is particularly apparent when observing the boundaries of the IR image which, after distortion removal, appear curved. Upon completion of the rectification process, we manually displace the rectified images horizontally until the principal scene planes in the two views appear spatially aligned. Then, we crop the overlapping areas and resample the resulting image portions such that the final image resolution matches the native spatial resolution of the IR/visible-light video pair. Fig. 0 presents the registration result for the Trees sequence. Note that this displacement process could be made automatic by identifying a region of interest in the images and corresponding points within it. Such a region of interest would correspond to a given scene depth. However, such a method would still have the limitation of not being able to perfectly register pairs having objects at very different depths. A similar

18 Andreas Ellmauthaler et al. Figure Result of image rectification for a sample IR/visible-light image pair from sequence Trees. For visualization purposes, the two images where overlaid on top of each other and occupy the red (visible-light) and green (IR) channels in the depicted RGB pseudo-color image. (a) (b) (c) Figure 0 Final registration results for an IR/visible-light image pair from the Trees image sequence. (a) Registered visible-light image. (b) Registered IR image. (c) RGB pseudo-color image where the registered visible-light and IR images of (a) and (b) occupy the red and green color channels, respectively. 0 problem is faced in [], a work that presents an almost automatic video registration method that relies on correspondences found via shape contour matching. Although out of the scope of our work, an option to perform that would be to, given a set of corresponding salient points in the two images, compute the depths and use and a depth-based image rendering [0] in one of the images to perform registration in all depths. However, such a method would still have to deal with the problem of occluded areas between the two cameras. In [] the occlusion problem is addressed using a video registration technique based on a RANSAC trajectory-to-trajectory matching for far-range videos. It estimates an affine transformation matrix that maximizes the overlapping of IR and visible foreground pixels. The method assumes that there is an intersection of the field of view between thermal and visible cameras and it does not employ any camera calibration method. A registration method for IR and

Title Suppressed Due to Excessive Length visible-light stereo videos based on local self-similarity is presented

Results 0 In this section the effectiveness of the developed IR/visible-light video registration framework is

For the sake of brevity we constrain our discussion to the registration results of temporarily and spatially

scene content and lighting conditions (see Table for more details).

Representative scene thumbnails of the selected IR/visible-light video sequences (before registration) are

. Figure Selected IR/visible-light scene thumbnails from the video sequences used for evaluation purposes.

. The VLIRVDIF challenges The IR/visible-light video pairs are naturally non-overlapping due to their different

To that end, the VLIRVDIF is designed to accentuate characteristics that impose different degrees of difficulties

For example, the VLIRVDIF is composed of video pairs with several IR/visible-light frames presenting different

associated visible-light/ir frame pairs.

Some methods may need rectified video frames which are not available when dealing with non-planar scenes.

19 Title Suppressed Due to Excessive Length visible-light stereo videos based on local self-similarity is presented in [], that also treats the problem of occluded regions in a scene. Results 0 In this section the effectiveness of the developed IR/visible-light video registration framework is demonstrated. For the sake of brevity we constrain our discussion to the registration results of temporarily and spatially misaligned IR/visible-light video sequence pairs, each originating from a distinct recording location with varying scene content and lighting conditions (see Table for more details). The results of the other remaining sequences constituting the VLIRVDIF are available at []. Representative scene thumbnails of the selected IR/visible-light video sequences (before registration) are illustrated in Fig.. Figure Selected IR/visible-light scene thumbnails from the video sequences used for evaluation purposes. The top row consists of visible-light images, whereas the bottom row represents the corresponding IR images.. The VLIRVDIF challenges The IR/visible-light video pairs are naturally non-overlapping due to their different spectral bands. This poses difficulties for the calibration, registration, temporal alignment and fusion processes. To that end, the VLIRVDIF is designed to accentuate characteristics that impose different degrees of difficulties to be tackled by these processes. For example, the VLIRVDIF is composed of video pairs with several IR/visible-light frames presenting different degrees of occlusion, as well as regions within the frames that have absolutely no similarities with their associated visible-light/ir frame pairs. The VLIRVDIF also provides non-planar scenes, where objects appear on different depth planes. Some methods may need rectified video frames which are not available when dealing with non-planar scenes. As the fundamental idea of registration is finding correspondences from video frame pairs to allow scenes and objects to be represented in a common coordinate system, these chacateristics impose a significant challenge to registration algorithms. In addition, some sequences were shot under high temperature levels, which can hamper the efficiency of fusion methods.. Temporal Alignment Results 0 The estimated temporal offsets t for the selected video sequences (see Fig. ) together with the corresponding similarity measures of eq. () are given in Table. The attained similarity is very close to for all six assessed video sequences. This implies that after

pairs corresponding to the scenes depicted in Fig., from left to right.

movements performed with the calibration board.

the robustness of the proposed approach.

20 0 Andreas Ellmauthaler et al. Table Results of the temporal offset estimation for the six different IR/visible-light video sequence pairs corresponding to the scenes depicted in Fig., from left to right. st pair nd pair rd pair th pair th pair th pair Temporal Offset 0 Similarity (a) (b) Figure Five calibration frames from the Lab image sequence (a) before and (b) after temporal alignment. 0 temporal alignment the movements of the calibration board are almost identical between the IR and visible-light video sequences. However, it is worth noting that the overall similarity measure depends, to a certain extent, on the movements performed with the calibration board. Thus, a small similarity does not necessarily imply a poor estimation of the temporal offset. In addition, the curve pictured in Fig. exhibits a single distinct peak corresponding to the position of the correct temporal offset, suggesting the robustness of the proposed approach. In order to visually demonstrate the effectiveness of the proposed temporal alignment scheme, Fig. shows five calibration frames from the second IR/visible-light video sequence pair of Table before and after temporal alignment. It can be noted that the unsynchronized video frames (Fig. a) display a significant misalignment in time. This is particularly evident when observing the four IR video frames to the right which appear to lag considerably behind the visible-light frames. As for the synchronized video frames (Fig. b), both IR and visible-light frames exhibit similar poses of the alignment board, thus indicating the correct temporal alignment of the IR/visible-light video sequence pair.

A NOVEL ITERATIVE CALIBRATION APPROACH FOR THERMAL INFRARED CAMERAS

A NOVEL ITERATIVE CALIBRATION APPROACH FOR THERMAL INFRARED CAMERAS Andreas Ellmauthaler, Eduardo A. B. da Silva, Carla L. Pagliari, Jonathan N. Gois and Sergio R. Neves 3 Universidade Federal do Rio de