Multi-Camera Calibration, Object Tracking and Query Generation

Similar documents
Fast Image Registration via Joint Gradient Maximization: Application to Multi-Modal Data

Change Detection by Frequency Decomposition: Wave-Back

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

Fundamental Matrices from Moving Objects Using Line Motion Barcodes

Chapter 3 Image Registration. Chapter 3 Image Registration

Disparity Search Range Estimation: Enforcing Temporal Consistency

Photometric Stereo with Auto-Radiometric Calibration

Chapter 9 Object Tracking an Overview

Segmentation and Tracking of Partial Planar Templates

Introduction to Medical Imaging (5XSA0) Module 5

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Supervised texture detection in images

Real-Time Human Detection using Relational Depth Similarity Features

Chapter 12 3D Localisation and High-Level Processing

Calculating the Distance Map for Binary Sampled Data

Detecting motion by means of 2D and 3D information

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Detecting and Identifying Moving Objects in Real-Time

Automatic Tracking of Moving Objects in Video for Surveillance Applications

EE795: Computer Vision and Intelligent Systems

QueryLines: Approximate Query for Visual Browsing

Toward Spatial Queries for Spatial Surveillance Tasks

Searching Video Collections:Part I

Face detection, validation and tracking. Océane Esposito, Grazina Laurinaviciute, Alexandre Majetniak

Motion Tracking and Event Understanding in Video Sequences

Multi-Camera Target Tracking in Blind Regions of Cameras with Non-overlapping Fields of View

Augmenting Reality, Naturally:

EE795: Computer Vision and Intelligent Systems

Data Term. Michael Bleyer LVA Stereo Vision

Multi-Person Tracking-by-Detection based on Calibrated Multi-Camera Systems

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

What have we leaned so far?

Lecture 28 Intro to Tracking

Region-based Segmentation

Recall: Blob Merge/Split Lecture 28

SE 263 R. Venkatesh Babu. Object Tracking. R. Venkatesh Babu

Volumetric Operations with Surface Margins

A Texture-based Method for Detecting Moving Objects

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

55:148 Digital Image Processing Chapter 11 3D Vision, Geometry

Dynamic Obstacle Detection Based on Background Compensation in Robot s Movement Space

CS 223B Computer Vision Problem Set 3

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

CS 231A Computer Vision (Fall 2012) Problem Set 3

Topics to be Covered in the Rest of the Semester. CSci 4968 and 6270 Computational Vision Lecture 15 Overview of Remainder of the Semester

Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

ROBUST OBJECT TRACKING BY SIMULTANEOUS GENERATION OF AN OBJECT MODEL

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

A Feature Point Matching Based Approach for Video Objects Segmentation

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

PERFORMANCE CAPTURE FROM SPARSE MULTI-VIEW VIDEO

Video Google: A Text Retrieval Approach to Object Matching in Videos

3D Environment Measurement Using Binocular Stereo and Motion Stereo by Mobile Robot with Omnidirectional Stereo Camera

Image Segmentation. Shengnan Wang

Towards Practical Evaluation of Pedestrian Detectors

Tracking and Recognizing People in Colour using the Earth Mover s Distance

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion

Stacked Integral Image

METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS

Next-Generation 3D Formats with Depth Map Support

Dense Image-based Motion Estimation Algorithms & Optical Flow

Robotics Programming Laboratory

Fast Region-of-Interest Transcoding for JPEG 2000 Images

Real-time Vehicle Matching for Multi-camera Tunnel Surveillance

A Texture-based Method for Detecting Moving Objects

Motion in 2D image sequences

Object Recognition with Invariant Features

Markov Random Fields and Segmentation with Graph Cuts

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Motion Detection Algorithm

Fast Human Detection Using a Cascade of Histograms of Oriented Gradients

Exhaustive Generation and Visual Browsing for Radiation Patterns of Linear Array Antennas

Face Tracking. Synonyms. Definition. Main Body Text. Amit K. Roy-Chowdhury and Yilei Xu. Facial Motion Estimation

Complex Sensors: Cameras, Visual Sensing. The Robotics Primer (Ch. 9) ECE 497: Introduction to Mobile Robotics -Visual Sensors

Broad field that includes low-level operations as well as complex high-level algorithms

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Video Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Panoramic Appearance Map (PAM) for Multi-Camera Based Person Re-identification

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

Road Extraction by Point-wise Gaussian Models

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Object Detection & Tracking

Using temporal seeding to constrain the disparity search range in stereo matching

Local Feature Detectors

Object detection using non-redundant local Binary Patterns

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

arxiv: v1 [cs.cv] 28 Sep 2018

Visual Tracking. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania.

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

Model-based Visual Tracking:

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER

Observing people with multiple cameras

AUTOMATIC VIDEO INDEXING

Outline. Data Association Scenarios. Data Association Scenarios. Data Association Scenarios

The IEE International Symposium on IMAGING FOR CRIME DETECTION AND PREVENTION, Savoy Place, London, UK 7-8 June 2005

Transcription:

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Multi-Camera Calibration, Object Tracking and Query Generation Porikli, F.; Divakaran, A. TR2003-100 August 2003 Abstract An automatic object tracking and video summarization method for multi-camera systems with a large number of non-overlapping field-of-view cameras is explained. In this framework, video sequences are stored for each object as opposed to storing a sequence for each camera. Objectbased representation enables annotation of video segments, and extraction of content semantics for further analysis. We also present a novel solution to the inter-camera color calibration problem. The transitive model function enables effective compensation for lighting changes and radiometric distortions for large-scale systems. After initial calibration, objects are tracked at each camera by background subtraction and mean-shift analysis. The correspondence of objects between different cameras is established by using a Bayesian Belief Network. This framework empowers the user to get a concise response to queries such as which locations did an object visit on Monday and what did it do there? This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2003 201 Broadway, Cambridge, Massachusetts 02139

MERLCoverPageSide2

Publication History: 1. First printing, TR-2003-100, August 2003

MULTI-CAMERA CALIBRATION, OBJECT TRACKING AND QUERY GENERATION Fatih Porikli, Ajay Divakaran Mitsubishi Electric Research Labs ABSTRACT An automatic object tracking and video summarization method for multi-camera systems with a large number of non-overlapping field-of-view cameras is explained. In this framework, video sequences are stored for each object as opposed to storing a sequence for each camera. Objectbased representation enables annotation of video segments, and extraction of content semantics for further analysis. We also present a novel solution to the inter-camera color calibration problem. The transitive model function enables effective compensation for lighting changes and radiometric distortions for large-scale systems. After initial calibration, objects are tracked at each camera by background subtraction and mean-shift analysis. The correspondence of objects between different cameras is established by using a Bayesian Belief Network. This framework empowers the user to get a concise response to queries such as which locations did an object visit on Monday and what did it do there? 1. INTRODUCTION The nature of single-camera single-room architecture multicamera surveillance applications demands automatic and accurate calibration, detection of object of interest, tracking, fusion of multiple modalities to solve inter-camera correspondence problem, easy access and retrieving video data, capability to make semantic query, and effective abstraction of video content. Although several multi-camera setups have been adapted for 3D vision problems, the nonoverlapping camera systems have not investigated thoroughly. In [2], a multi-camera system that uses a Bayesian network to combine multiple modalities is proposed. Among these modalities, the epipolar, homography, and landmark information assume any pair of cameras in the system has an overlapping field-of-view. Due to this assumption, it is not applicable to the single-camera single-room architecture. A major problem of multi-camera systems is the color calibration of cameras. In the past few years, many algorithms were developed to compensate for radiometric mismatches. Most approaches use registered images of a uniformly illuminated color chart of a known reflectance taken under different exposure settings as a reference [1], and esti- Fig. 1. A multi-camera setup can contain several cameras working under different lighting conditions. mate the parameters of a brightness transfer function. Often, they assume the function is smooth and polynomial. However, uniform illumination conditions may not be possible outside of a controlled environment. In this paper, we designed a framework where we can extract the object-wise semantics from a non-overlapping field-of-view multi-camera system. This framework has four main components: camera calibration, automatic tracking, inter-camera data fusion, and query generation. We developed an object-based video content labeling method to restructure the camera-oriented videos into object-oriented results. We also propose a summarization technique using the motion activity characteristics of the encoded video segments to provide a solution to the storage and presentation of the immense video data. To solve the calibration problem, we developed a correlation matrix and dynamic programming based method. We use color histograms to determine inter-camera radiometric mismatch. Then, a minimum cost path within the correlation matrix is found using dynamic programming. This path is projected onto diagonal axis to obtain a model function that can transfer one histogram to other. In addition, we use a novel distance metric to determine the object correspondences. 2. RADIOMETRIC CALIBRATION A typical indoor surveillance system consists of several nonoverlapping view of cameras as illustrated in Fig. 1. Usually, such systems consist of identical cameras that are operating under various lighting conditions, or different cameras that have dissimilar radiometric characteristics. Even identical cameras, which have the same optical properties and

Fig. 2. After computing correlation matrix, a minimum cost path is found and transformed to a model function. Using the model function obtained in the calibration stage, the output of one camera is compensated. Fig. 3. Relation between the minimum cost path and f (j). are working under the same lighting conditions, may not match in their color responses. Images of the same object acquired under these variants show color dissimilarities. As a result, the correspondence, recognition, and other related computer vision tasks become more challenging. We compute pair-wise inter-camera color model functions that transfer the color histogram response of one camera to the other as illustrated in Fig. 2 in the initial calibration stage. First, we record images of the identical objects for each camera. For the images of an object for the current camera pair 1-2, I 1 k,i2 k, we find color histograms h1 k, h 2 k. A histogram, h, is a vector [h[0];:::;h[n ]] in which each bin h[n] contains the number of pixels corresponding to the color range. Using the histograms h 1 k ;h2 k, we compute a correlation matrix M 1;2 k between two histograms as the set of positive real numbers that represent the bin-wise histogram distances 2 3 M 1;2 = 4 c 11 c12 ::: c1n : : : 5 (1) c N1 ::: c NN where each element c mn is a positive real number such that c mn = d(h1[m];h2[n]) and d( ) 0 is a distance norm. Note that, the sum of the diagonal elements represents the bin-by-bin distance with given norm d( ). For example, by choosing the distance norm as L2 the sum of the diagonals becomes the Euclidean distance of histograms. An aggregated correlation matrix M is calculated by averaging the P corresponding matrices of K image pairs as M 1;2 K =1=K k=1 M1;2 k. Given two histograms and their correlation matrix, the question is what is the best alignment of their shapes and how can the alignment be determined? We reduce the comparison of two histograms to finding the minimum cost path in the matrix M 1;2. We used dynamic programming and modified Dijkstra s algorithm to find the path p : f(m0;n0); ::; (m I ;n I )g that has the minimum cost from the c11 to c NN, i.e. the sum of the matrix elements on the path p gives the minimum score among all possible routes. We define a cost function for (camera 1) (camera 2) (compensated) 0.025 0.02 0.015 0.01 0.005 50 50 0 0 50 100 150 200 250 0 100 0 50 100 150 200 250 Fig. 4. Camera-2 image is compensated using model func. (Histogram black: camera-1, blue: camera-2, red: compen.) the path as g(p i ) = c mi;n i where p i denotes the path element (m i ;n i ). We define a mapping i! j from the path indices to the projection onto the diagonal of the matrix M, and an associated transfer function f (j) that gives the distance from the diagonal with respect to the projection j. The transfer function is defined as a mapping from the matrix indices to real numbers (m i ;n i )! f (j) such that f (j) = jp i j sin as illustrated in Fig. 3. The correlation distance is P the total cost along the transfer function I d CD (h1;h2) = i=0 jg(m i;n i )j. 3. SINGLE-CAMERA TRACKING After initial calibration, objects are detected and tracked at each camera (Fig. 5). A common approach for detecting a moving object for a stationary camera setup is background subtraction. The main idea is to subtract the current image from a reference image that is constructed from the static image pixels during a period of time. We previously developed an object detection and tracking method that integrates a model-based background subtraction with a meanshift based forward tracking mechanism [3]. Our method constructs a reference image using pixel-wise mixture models, finds changed part of image by background subtraction, removes shadows by analyzing color and spatial properties of pixels, determines objects, and tracks them in the consec-

Fig. 5. Single-camera tracking. Fig. 6. Multi-camera surveillance system. utive frames. In background subtraction, we model the history of a color channel each pixel by a mixture of Gaussian distributions. The reference image is updated by comparing the current pixel with the existing distributions. In case the current pixel s color value is within a certain distance of the mean value of a distribution, the mean value is assigned as the background. After background subtraction, we detect and remove shadow pixels from the set of the foreground pixels. The likelihood of being a shadow pixel is evaluated iteratively by observing the color space attributes and local spatial properties. The next task is to find the separate objects. To accomplish this, we first remove speckle noise, then determine connected regions, and group regions into separate objects. We track objects by computing the highest gradient direction of color histogram, which is implemented as a maximization process. This process is iterated by given the current object histogram extracted from the previous frame. The tracking results at each camera should be merged to determine the global behaviour of objects (6). 4. INTER-CAMERA CORRESPONDENCE Another main issue is the integration of the tracking results of each camera to make inter-camera tracking possible. To find the corresponding objects in different cameras and in a central database of the previous appearances, we evaluate the likelihood of possible object matches by fusing the object features such as color, shape, texture, movement, camera layout, ground plane, etc. Color is the most common feature that is widely accepted by object recognition systems since it is relatively robust towards the size and orientation changes. Since adequate level of detail is usually not available in a surveillance video, texture and face features does not improve recognition accuracy. Another concern is the face orientation. Most face-based methods work only for frontal images. The biggest drawback of shape features is the sensitivity to the boundary inaccuracies. Using a height descriptor will only help if we have the ground plane. Fig. 7. Each camera corresponds to a node in the directed graph. The links show physical routes between cameras. The probability of object movement is represented using the transition table. There is a strong correlation between camera layout and likelihood of the objects appearing in a certain camera after they exit from another one. As in Fig.7, we formulate the camera system as a Bayesian Belief Network (BBN), which is a graphical representation of a joint probability distribution over a set of random variables. A BBN is a directed graph in which each set of random variable is represented by a node, and directed edges between nodes represent conditional dependencies. In the multi-camera system, each camera corresponds to a node in the directed graph. The links show the physical routes between the cameras. The probability of object movement is the elements of the transition table. The transition probabilities, that is the likelihood of a person moving from the current camera to a linked camera, are learned by observing the system. Note that, each direction on a link may have different probability, and the total incoming and outgoing probability values are equal to one. To satisfy the second constrain, some slack nodes that correspond to the unmonitored entrance/exit regions are added to the graph. Initially, there is no object assigned to any node of the BBN, the number of objects in the cameras and objects in the database are equal to zero. The database keeps track of the individual objects. Let an object O i is detected at camera C i. For each detected new object, a database entry is made using its color histogram features. If the object O 1 exits from the camera C i, then the conditional probability P O1 (C j jc i ) of the same object will be seen on another cam-

era C j is found by P O1 (C j jc i )=P (C j jc s1)::p (C sk jc i ) where fs1; ::; skg is the highest probability path from C i to C j on the graph. Because of the dynamic nature of the surveillance system the conditional probabilities change with time, P O1 (C j ;tjc i ) = P (C j ;tjc s1)::p (C sk ;tjc i ). The conditional probabilities are set to erode by time using a parameter ff < 1 as P (O1;tjC i ) = ffp (O1;t 1jC i ) since the object may exit from the system completely (We do not think a multi-camera system should be closed graph). However, the probabilities do not become less than a threshold. As a new object is detected, it is compared with the objects in the database and with the objects that disappeared previously. The comparison is based on the color histogram similarity and the proposed distance metric. For more than one object correspondence, we select the match as the pair (O m ;O n ) that maximizes P (O m ;tjc i )P (O n ;tjc j ). We evaluate the matching for all objects simultaneously instead of matching independently to match objects between two cameras C i and C j, (a) (b) 5. OBJECT BASED QUERIES After matching the objects between the cameras, we label each video frame according to the object appearances. This enables us to include content semantics in the subsequent processes. A semantic scene is defined as a collection of shots that are consistent with respect to a certain semantic, in our case, the identities of objects. Since we know camera locations and we extracted which objects appeared in which video at what time, we can, for instance, query for what an object did in a certain time period at a certain location. To obtain concise representation of query results, we generate an abstract of the query result. The key-frame-based summary is a collection of frames that aims to capture all of the visual essence of the video except, of course, the motion. In [4], we showed that the frame at which the cumulative motion activity is half the maximum value is also the halfway point for the cumulative increment in information. Thus, we select the key frames according to the cumulative function. 6. DISCUSSION A base station controls the fusion of the multi camera information, and its computational load is negligible. We presented sample results of the single-camera tracking module in Fig. 8-a, and the object-based inquiry interface in Fig. 8-b. The tracking method can follow several objects at the same time even some of the objects are totally occluded for a long time. Furthermore, it provides accurate object boundaries. We initialized the Bayesian Network with identical conditional probabilities. These probabilities may also be adapted by observing the long-term motion behaviour of the objects. Sample query results are shown in Fig. 8-c. We are (c) Fig. 8. (a) Inquiry system, (b) extracted instances of an object in all three cameras. able to extract all appearances of the same person in a specified time period accurately. We can also count the number of different people. We presented a novel color calibration method. Unlike the existing approaches, our method does not require uniformly illuminated color charts or controlled exposure image sets. Furthermore, our method can model non-linear, non-parametric distortions, and the transitive property makes the calibration of large-scale systems much simpler. The object-based representation enables us to associate content semantics, so we can generate query based summaries. This is an important functionality to retrieve the target video segments from a large database of surveillance video. 7. REFERENCES [1] M. Grossberg and S. K. Nayar, What can be known about radiometric resp. func. using images, Proc. of ECCV, 2002. [2] T. Chang and S. Gong. Bayesian modality fusion for tracking multiple people with a multi-camera system. AVSS, 2001 [3] F. Porikli, O. Tuzel. Human body tracking and movement analysis, Proc. ICVS-PETS, Austria, 2003. [4] A. Divakaran Video summarization using descriptors of motion activity Electronic Imaging, 2001.