Managing and Mining Video-Sensor Data

Size: px

Start display at page:

Download "Managing and Mining Video-Sensor Data"

Lee Armstrong
6 years ago
Views:

1 Managing and Mining Video-Sensor Data Edward Y. Chang, Ankur Jain, Navneet Panda, Yuan-Fang Wang, Gang Wu, and Yi Wu Department of Electrical Engineering & Computer Science University of California, Santa Barbara, CA Abstract Sensor research has been undergoing a quiet revolution, promising to have significant impact on a broad range of applications relating to national security, health care, the environment, etc. In this work, we particularly focus our effort on building large-scale video-sensor networks, addressing several large-scale data-management and data-mining issues. We outline our preliminary results in two research areas: sensornetwork resource management, and statistical methods for sensor-data mining. We also discuss future directions that research might take. 1 Introduction Video sensors (video cameras) and wireless networks are becoming ubiquitous features of modern life. The confluence of these two technologies now makes it possible to construct wireless ad hoc networks of multiple video sensors that can be rapidly deployed, dynamically configured, and continuously operated to provide highlyavailable coverage for environment monitoring and security surveillance. While many extended eyes are being installed at an unprecedented pace, the intelligence needed for interpreting video-surveillance events by computers is still rather unsophisticated. In addition, algorithms that can scale well with the number of sensors and high-volume of data are yet to be developed for effectively managing and mining video-data streams. To develop a brain behind a large number of optical eyes to support (semi-) automatic event sensing, we have been developing statistical algorithms to improve the two major phases of a distributed, mobile surveillance system: data management and event mining [35, 122]. The datamanagement phase integrates multi-source data to detect and extract motion trajectories from video sources. (While motion analysis is the charter of the Computer Vision community, our work focuses on resource management and data integration.) The event-mining phase deals with classifying the events as to relevance for the query. The research challenges of the two phases are summarized as follows: Data management. Data management deals with sampling and filtering data at the sensors, transmitting the data (could be noisy or partial) to the server, and fusing distributed data to extract motion trajectories. Data management comes up against two major research challenges: distributed sensor & sensornetwork management and spatio-temporal data fusion. Given a query and its precision requirement, the sensor-network may need to move some cameras and reconfigure itself to see the event of interest. For instance, when a far-field camera detects some movement, a pan/tilt/zoom camera may be instructed to zoom onto the moving object, and at the same time, more resources (network and processing bandwidth) are allocated to these cameras. This reconfiguration must be performed in such a way that most useful data can be collected, given the resource constraints, to answer queries. Once useful data have been collected from cameras, the next challenge is to integrate observations from multiple cameras to build spatio-temporal patterns (to perform spatio-temporal join) that can best describe events in the environment. Such integration is necessary to improve surveillance coverage and reliability, and to deal with transient object-tracking obstacles such as spatial occlusion and scene clutter. Event mining. Event mining deals with mapping motion trajectories (sequence data) to semantics (e.g., benign and suspicious events). Most traditional statistical learning algorithms cannot be directly applied to variable-length sequence data, which may also exhibit temporal ordering. In addition, positive events (i.e., the sought-for hazardous events) are always significantly outnumbered by negative events in the training data. In such an imbalanced set of training data, the class boundary tends to skew toward the minority class and becomes very sensitive to noise. Furthermore, the best feature-to-event mapping is often application-, task-, and user-dependent. To provide useful results, the event recognizer must adapt its distance function to the circumstances as needed.

2 To answer the above challenges, we have been working on four research tasks to advance fundamental theories and develop statistical methods that can significantly improve the operation of video-sensor networks, quality of data fusion, and accuracy of event analysis. 1. Sensor-network resource management (Section 2). We have been developing statistical methods to manage networks for conserving resources, including power at the sensor nodes, as well as network bandwidth and other system resources at the server. 2. Statistical sensor-data fusion (Section 3). We have been devising algorithms to fuse spatially and temporally overlapped data for reliable event detection. Our research focus is on enhancing the reliability of existing object-tracking algorithms by performing both sensor-to-server data fusion and server-to-sensor information dissemination. 3. Sequence-data to event mapping (Section 4). Many abstractions (e.g., Fourier, Wavelets, string descriptors) have been proposed in the past to represent sequence data. However, we believe that the best abstraction should be event-dependent. Therefore, our approach will be first, to extract multi-resolution descriptors from sequence-data, and then rely on the algorithms that we subsequently develop to learn the best descriptor-combination for a target semantic. Furthermore, in contrast to the existing approaches (e.g., widely-used Hidden Markov Models), our statistical methods require a much smaller amount of training data to model a target event. Reducing training-data (or sample complexity) is critical for event detection, since training data for target events are often difficult to collect. 4. Context-based distance function formulation (Section 5). The choice of a good distance function plays a key role in any information retrieval task. We propose to formulate distance functions in a task- and query-dependent way. (Most traditional data-mining tasks employ a distance function such as or universally without considering the characteristics of queries, or the preferences of users.) Our distancefunction formulation method consists of two steps: 1) using active learning for acquiring application- and query-dependent information, and 2) applying kernel alignment for modifying input space in an efficient, non-linear way. The remainder of this paper presents for each of the above four tasks its related work, preliminary results, and future research plans. 2 Sensor-network Resource Management In a distributed sensor network, cameras record continuous high-volume video streams. Because of the high data volume and rapid rate, it is infeasible for an untethered, battery-powered sensor node to transmit a large quantity of raw data to a server for processing [10, 53, 87]. To conserve resources network bandwidth, storage, and CPU many recent papers [3, 7, 30, 95, 113] propose methods to reduce the amount of data delivered to the server. In these schemes, provided that the server can answer queries within specified precision constraints, data communication is not enacted. This research task investigates statistical methods for meeting the specified query precision while consuming a minimum amount of resources. A major shortcoming of the existing solutions is that they are often ad hoc, as explained in [7] by Widom and Motwani, and are highly application-dependent. To seek for a unified solution for managing distributed streams, we treat resource management in a sensor network as fundamentally a filtering problem: an effective stream-filtering algorithm should filter out the maximum amount of data as long as the query-precision constraints are met at the server. We introduce our Dual Kalman Filter ( ) architecture [66] as a general and adaptive solution to the stream-resource-management problem. Figure 1 depicts the role of our proposed (Dual Kalman Filter) model in a typical sensor-network architecture. A user (on the left-hand side of the figure) issues to the server an event query with certain precision constraints. The server activates a, denoted as, and at the same time, the target sensor activates a mirror with the same parameters, denoted as. The dual filters and predict future data values. Only when the filter at sensor fails to predict future data within the precision constraint (thus preventing from making an accurate prediction at the server) does the sensor send updates to. For instance, if no interesting event is taking place at a sensor, no data transmission is made to the server. When multiple events are taking place at a sensor, multiple pairs of and will be invoked to track the events. Significant bandwidth conservation can be achieved if a reliable and accurate data prediction mechanism is employed, and the server resources can be allocated to the sensors where actions are taking place. We have proposed to use the Kalman Filter as such a mechanism for its simplicity, efficiency, and provable optimality under fairly general conditions. Our preliminary results indicate that shows promise in several scenarios with which we experimented [66]. We will evaluate extended Kalman Filter and other models to identify the best solution(s) for given data-stream characteristics. 2.1 Relationship to Current State of Knowledge Algorithms in sensor-network resource management or stream management have received increased attention in the database community over the past two years. Major research directions include conserving computational and communication resources [3, 95], optimal storage algorithms, stream mining [130], query optimizers [10], and query solvers [64]. A comprehensive survey of the issues

3 User Precision Constraint + Smoothing Factor Query Processor Streaming Attribute Values Central Server Server KF KF 1 s Sever KF KF 2 s Server KF KF t s Smoothing Factor Covariance Matrix State Transition Matrix Communication Network Filtered Data Filtered Data Filtered Data Filtered Data Mirror KF KFm 1 δ 1 Mirror KF KFm 2 δ 2 Filtered Data Mirror KF KFm t δ t Remote Source 1 Smooth Data Smoothing KF Noisy Data KFc 1 F 1 Remote Source 2 Smooth Data Smoothing KF Noisy Data KFc 2 F 2 Remote Source t Smooth Data Smoothing KF Noisy Data KFc t F t Figure 1: The in data streams is presented in [10, 53]. Data streams have also been treated as a time series. Ideas from control theory were borrowed for the purposes of approximation [23] and mining [130]. Table 1 presents a comparative overview of our work contrasted with three other major data stream projects: STREAM [7, 95], AURORA [3, 113], and COUGAR [129, 26]. None of the compared approaches uses a prediction scheme, nor do they seem to degrade gracefully when the input data are noisy. Furthermore, they cannot exploit partial information about the stream arrival characteristics (if available) to boost their performance. Our general framework, however, can be applied to any streaming application by simply modifying the state transition matrix used in the Kalman Filter formulation. We advocate the use of the Kalman Filter ( ) for stream-filtering, since has been well studied and widely applied to many data-filtering and smoothing problems for over years. The Kalman Filter can be easily customized to handle varying stream characteristics, sensor noise, and time variance to meet the requirements specified in [87]. The same filtering framework can be adapted to address a large variety of stream resource management problems, providing a unified paradigm that is both powerful and versatile. Due to the space limitations, we refer the readers to [66] for an in-depth discussion on the rationale behind our choice of using the Kalman Filter. 2.2 Preliminary Results We have employed in our multi-camera surveillance prototype for analyzing vehicle/human behavior in a parking lot. One experiment that we have conducted is on moving-vehicle tracking. In our experiment, each moving object had two attributes: namely, location (in terms of and coordinates) and velocity (in terms speed and angle of direction). We used a uniform random-number generator to generate different slopes of the velocity vector at random intervals of time. We generated different speeds of the object at random time intervals in a similar manner. Thus the object could randomly change its speed and heading, and then continue on that linear path for a randomly generated length of time. The maximum speed model of the object was limited to 500 units, whereas the slope could arbitrarily change by any amount. We constructed a data set shown in Figure 2(a), using the above model containing data points at a sampling rate of ms. We tested the performance of the Kalman Filter approach on two different state models : Constant model: The system is modeled such that the latest updated value is the best prediction for the future. This model is conceptually similar to the standard cached approximation model. The measurement consists of just the position of the object in the two-dimensional space, i.e., coordinate and coordinate. Linear model: Here we take the rate of change of the position into consideration when predicting future values. Figure 2(b) shows comparative results of the two Kalman Filter models with the cached approximation scheme. Measurements are taken in the form of position. Given a precision constraint, point is updated to the server if an error in either X or Y value is greater than. In both models, only the position is recorded, not a measurement of the rate of change of coordinate values. As evident from Figure 2(b), the percentage of updates is the same, whether using caching or the constant model. This is because the constant model is similar to the caching scheme when the rate of change of values is not considered. However, if we use the linear model, we see that utilization of the communication resource is cut down by approximately at a moderate precision width of units. As the precision width increases, the communication resource utilization drops, and all three models show comparable performances. We also observe that the performs at least as well as the caching scheme, even in a worst-case scenario. 2.3 Future Work Many variations are possible in formulating a resourcemanagement problem. What is unique about the Kalman Filter, as we explained in detail in [66], is that the Kalman Filter can be customized to form workable solutions for all these formulations. The adaptation involves both simplification (e.g., static Kalman Filter or recursive least squares)

4 Stream System Proposed Solution Advantages of using Kalman Filter STREAM Adaptive precision bounds, approximate value cached at server, does not work for noisy data (no data smoothing), dynamic precision widths are cached at the server, best estimate for future is the last cached value at the server. Prediction algorithm can be used to reduce communication overhead even further, on-line data smoothing helps to provide query answers even for noisy data. AURORA COUGAR Static precision widths, resource management using dynamic sampling rates based on loss/gain ratios, bounds do not change with input characteristics of the stream. Partial query processing in the wireless network to prevent unnecessary data being forwarded (load shedding). Does not use any approximation or caching scheme. Prediction mechanism is based on input characteristics, output is sensitive to input values. Prediction scheme gives better results, reducing load adaptively rather than dropping chunks of data indiscriminately. Table 1: Summary of existing solutions and advantages of using Kalman Filter (a) Moving-object data set (b) Number of updates Figure 2: A Resource Conservation Example Using Kalman Filter. and generalization (e.g., extended Kalman Filter). Our future work will further tap into the strengths of the Kalman Filter, and we will investigate the following issues and incorporate their solutions into our model: Tuning system parameters for multiple queries with multiple attributes. Developing solutions for adaptively adjusting the sampling rate based on the innovation sequence. Evaluating models for handling data streams that exhibit non-linear patterns. Some candidate models to evaluate are particle filters and several extended Kalman filters. 3 Statistical Sensor-Data Fusion The server receives video streams from distributed cameras, each of which has limited spatial and temporal coverage, is potentially noisy, and is susceptible to occlusion and scene clutter. (To conserve bandwidth, a video camera can send just detected motion patterns, not raw video frames, to the server.) To achieve wide-area coverage, data from cameras must be fused. Fusing spatially and temporally overlapped data is a challenging task, since cameras may have different sampling rates and resolutions, and some cameras may be mobile. We propose here a hierarchical master-slave fusion scheme. Referring to Fig. 3, at the bottom level, each slave station tracks the movements of scene objects semiindependently. The local trajectories are then relayed to a master station for fusing into a consistent, global representation. This represents a bottom-up analysis paradigm. Furthermore, as each individual camera has a limited field Figure 3: Two-level Kalman Filter configuration. of view, and occlusion occurs due to scene clutter, we also employ a top-down analysis module that disseminates fused information from the master station to slave stations. This top-down information dissemination process assists in tracking, cross validation, and error recovery if the camera should lose track of an object. 3.1 Relationship to Current State of Knowledge To construct descriptions of motion events we must be able to track object movement. Object tracking has been extensively studied in the Computer Vision community, (e.g., [6, 12, 36, 37, 49, 51, 62, 63, 75, 101, 99, 100, 108, 124, 126, 128]). For this database project, we will not develop new object-tracking algorithms. Rather, we will enhance the reliability of existing algorithms by performing data fusion.

5 Sensor data fusion refers to the task of combining multiple-sensor data in a complementary and synergistic way to improve data availability, reduce noise, and improve robustness in the analysis. Sensor data can be fused for multiple sensors of the same or different types. This fusion can be done at data, feature, or decision levels. Data and feature fusion strategies are often used for combining heterogeneous sensor data, (e.g., in fusing inertia, ultrasonic, and vision sensors for mobile robotics applications [18, 31, 82, 83, 88, 94, 119]), and in fusing multi-image modalities (e.g., infrared and vision sensors) for target recognition and scene interpretation [86, 90, 68, 69, 54]. IBR (image-based-rendering) techniques [29, 38, 81, 25, 85, 91, 92, 110, 112, 105] can also be considered a data fusion strategy where single or multiple sensors, often of the same kind, are used to construct an environment map. Decision fusion strategies have their roots in pattern recognition [45, 50, 117]. These strategies have many well-established algorithms [14, 20, 21, 44, 46, 76, 57, 103, 104, 43] that are readily applicable to sensor data fusion. Our unique contribution is in using two-level Kalman Filters with both bottom-up and top-down analysis for data fusion and information dissemination from and to multiple sensors, thus improving tracking reliability. 3.2 Preliminary Results We used the Kalman Filter [22, 84] as the tool for fusing information spatially and temporally from multiple cameras to detect motion events. Suppose that a vehicle (or a person) is moving in the parking lot. Its trajectory is described in the global reference system by. The trajectory may be observed in camera, as, where ( is the number of cameras used). The goal is then to optimally track, correlate, and fuse individual camera trajectories into a consistent, global description. 1 We formulate the solution as a two-level hierarchy of the Kalman Filters. Referring to Fig. 3, at the bottom level of the hierarchy, we employ for each camera a Kalman Fil- ter to estimate, independently, the position, velocity, and acceleration of the vehicle, based on the tracked image trajectory of the vehicle in the local camera 1 There are two issues that need to be addressed here: registration and correspondence. First, to fuse measurements from multiple sensors into one global estimate, two registration processes are needed: spatial registration to establish the transformation among different camera coordinate frames, and temporal registration to synchronize multiple local clocks. These techniques are well established in the literature, [32, 42, 49, 55, 58, 109, 128, 131], and we have developed algorithms to accomplish both spatial and temporal registration [67]. Second, it may be difficult to synchronize the activities observed in multiple cameras. The question is how to disambiguate the correspondence of multiple trajectories. Spatial and temporal trajectory correspondence can be established through the camera registration and stereopsis correspondence processes which are well established techniques in photogrammetry and computer vision. For our discussion, we will assume that these problems can be solved and we can achieve spatial and temporal registration of trajectories and disambiguate among multiple trajectories. (Interested readers are referred to our recent paper [67] for more details.) reference frame. Or in the Kalman Filter jargon, the position, velocity, and acceleration vectors establish the state of the system while the image trajectory serves as the observation of the system state. At the top level of the hierarchy, we use a single Kalman Filter to estimate the vehicle s position, velocity, and acceleration in the global world reference frame this time, using the estimated positions, velocities, and accelerations from multiple cameras (,, ) as observations (the solid feed-upward lines in Fig. 3). This is possible because camera calibration and registration [32, 49, 58, 128, 131] are used for deriving the transform matrices ( "!"#$&%(')+*-, and.%(')+*-,"$!/# in Fig. 3). These matrices allow, measured in the reference frame of an individual camera, to be related to P in the global world system. An interesting scenario occurs when one (or more) cameras in the sensor network loses track of an object. This can happen because of scene clutter, self- and mutualocclusion, or the tracked objects exiting the field-of-view of a camera, among many other possibilities. The camera could switch from a track mode into a re-acquire mode by searching the whole image for telltale signs of the object; however, doing so inevitably slows down eventprocessing and introduces a high degree of uncertainty in the resulting event description. Instead, we allow the dissemination of fused information to individual cameras (the dashed feed-downward lines in Fig. 3) to help guide the reacquisition process. The Kalman Filter, being a flexible information-fusion algorithm, can readily use the fused information (instead of sensor data) for maintaining and updating state vectors. This hierarchical feed-upward (for sensor data fusion) and feed-downward (for information dissemination) filter structure thus provides a powerful and flexible mechanism for joining sensor data spatially. We have collected hours of video using multiple video cameras in a parking lot. The video frames depicted both human and vehicular motion. The motion patterns for vehicles included entering, exiting, turning, backing up, circling, zig-zag driving, and many more. For human motion, we recorded actions involving both individuals and groups, with patterns such as following, following-andgaining, stalking, congregating, splitting, and loitering, among many others. Some of these patterns (like zig-zag driving and stalking) were acted out by our group members, while others represented behaviors commonly observed in the parking lot. Due to space limitations, we show only two sample results here. Sample results for tracking the movements of people in a parking lot are shown in Fig. 4(a) and (b). Of the three cameras we used, the views of two were partially occluded by parked cars 2. The individual camera trajectories could therefore be broken. However, by using our two-level filter structure, we were able to fill in the gap, smooth out sensor noise, and fuse individual trajectories into a complete, global descrip- 2 The camera positions in these figures indicate only the general directions of camera placement. The actual cameras were placed much farther away from the scene and always pointed to the parking lot.

(a) (c) (b) Figure 4: (a) A simulated stalking behavior in a parking lot and (b) trajectories of the sample stalking behavior. (c) and (d): similar data fusion results for vehicular motion.

4(c) and (d) show the analysis of a vehicle s driving pattern when two cameras were used.

trajectory), we were able to fuse the individual camera trajectories to arrive at a complete description. 3.

sources. Improving the robustness of the Kalman Filter by allowing tracking of multiple state-estimates per object.

6 (a) (c) (b) Figure 4: (a) A simulated stalking behavior in a parking lot and (b) trajectories of the sample stalking behavior. (c) and (d): similar data fusion results for vehicular motion. In these figures, the - is the fused trajectory;. is the tracked trajectory from camera 1; x is the tracked trajectory from camera 2; and o is the tracked trajectory from camera 3. (d) tion. Fig. 4(c) and (d) show the analysis of a vehicle s driving pattern when two cameras were used. Note that even with a very small overlap in the fields-of-view of the two cameras and a circling motion covering a large spatial area (hence, each camera observed only a part of the motion trajectory), we were able to fuse the individual camera trajectories to arrive at a complete description. 3.3 Future Work We plan to investigate the following issues and incorporate their solutions into our existing model: Modifying the hierarchical Kalman-Filter model to perform spatial join for mobile sources. Improving the robustness of the Kalman Filter by allowing tracking of multiple state-estimates per object. While the Kalman Filter is a simple and powerful mechanism for state estimation, its validity is questionable if the assumptions on the prior and noise are not valid. Furthermore, there are situations where multiple hypotheses have to be kept until a later time when more visual evidence has been gathered to validate some and discredit others. For example, if two or a group of persons enter the field of view of a camera in such a way that their silhouettes overlap significantly or one completely occludes the other, it can be difficult for the tracking algorithm to discern if such a moving region corresponds to a single person or several persons. The single-person hypothesis might be kept until it can be safely discredited. We will address this problem using hypothesis-andverification paradigms through sampling [75, 97, 62, 63]. 4 Sequence-data to Event Mapping To interpret video-sensor data, we need to map sequence data (motion trajectories) to events. A sequence data is defined as an ordered set of items: These items are logically contiguous and each item denotes a set of attributes varying according to different applications. Given a set of sequences, that can be partitioned into a labeled subset and an unlabeled subset, the task of sequence-data learning is to learn a discriminative function from set using algorithm. Then, using, we can predict the label for unlabeled sequence. To conduct supervised learning with a small number of training instances, the discriminant approach has been shown to be much more effective [65] than the generative approach (as in HMMs). In particular, SVMs require only those boundary instances (support vectors) to participate in a class prediction, and hence require a much smaller amount of training data than the other methods. Unfor-

7 tunately, traditional kernel functions (such as polynomial and RBF functions) that have been employed with SVMs assume a feature space of fixed dimensions. They cannot be applied to sequence data, which are variable-length in nature. We will design kernel functions that can effectively handle variable-length sequence data. To conduct supervised learning, we need first to extract useful information (features) from sequence data to form representations [132]. Although many representations have been proposed in the past (see Section C.1 for detailed discussion), we believe that the best representation should be event-dependent. Therefore, our approach will be first, to extract multi-resolution descriptors from sequence-data, and then rely on the algorithms that we subsequently develop to learn the best descriptorcombination for a target semantic. For instance, a motion pattern can be depicted as a sequence of symbolic strings at the coarse level, yet detailed information such as velocity and acceleration is recorded at the refined levels. If an event concerns only the turning pattern of a vehicle, then the coarse-level symbolic representation may be adequate; otherwise, proper secondary structures should be used. To support multi-resolution learning, we will 1) design kernel functions to characterize similarity at individual resolution-levels, and 2) research kernel-fusion mechanisms to integrate kernels at multiple levels. For both individual kernel design and kernel fusion, we will prove the kernels to be mathematically valid and verify them to be effective. 4.1 Relationship to Current State of Knowledge Sequence-data learning involves two major subtasks: description of sequence data and development of a sequencedata learning algorithm. Related work in these two areas is summarized as follows Sequence-data Representation Many sequence-data descriptions have been introduced in the past. Figure 5 summarizes the more popular sequencedata descriptions in the literature. Traditional descriptions can be roughly divided into numeric-valued and symbolicvalued types. Numeric-based descriptions represent raw sequences as a sequence of transformed numeric values. For example, Discrete Fourier Transform (DFT) [48] uses Fourier transformation to represent original sequences. Discrete Wavelet Transform (DWT) [60] applies wavelet transform as sequence representation. Singular Value Decomposition (SVD) [77] uses eigenvalues to provide information about the underlying structure of the data. Piecewise linear approximation (PLA) [96] approximates each subsequence by a linear function. Then the coefficients of each linear function are concatenated as a new sequence representation. Piecewise aggregate approximation (PAA) [70] segments each sequence into a fixed number of subsequences, and then the mean values of each subsequence are concatenated as a data-reduced representation. Symbolic representations such as natural language [56] and Discrete Discrete Fourier Wavelet Transform Transform Numeric valued Representation Singular Piece wise Value Linear Decomposition Approximation Sequence Representation Symbolic valued Representation Piece wise Natural String Aggregate Language Approximation Figure 5: Sequence Representation Literature. strings [28] are treated as a simplified transformation of the original data that retains much of the important temporal information. Although symbolic representation does not retain much numeric detail, it enjoys the advantage of improved computational efficiency. Also the analysis of symbolic data is often less sensitive to measurement noise [41]. For a specific application, choosing the best representation is a challenging research problem. Keogh s paper reports a comprehensive series of experiments to show that there is no sequence-data representation that works best for all kinds of datasets [71]. The best sequence-data representation is actually application-dependent. The goal of this research task is to find the best representation (the best combination of descriptors) in an application-dependent and query-dependent manner Sequence-data Learning Being able to measure similarity between data instances accurately is fundamental to learning. Traditional sequence-data similarity measurements include Minkowski metrics [77] and Non-Minkowski metrics [74]. More sophisticated sequence distance measures such as dynamic time warping (DTW) [9, 72], piecewise normalization [61], suffix tree [59], edit distance [19], and cosine wavelets [60] have been investigated for sequences with variable length. Another popular method for modeling and predicting temporal sequences is based on HMMs [98, 16]. HMMs model sequential dependencies by treating the sequence as a Markov chain. HMMs have been shown to be effective in applications such as speech recognition. Despite their success, HMMs may require significant numbers of training data when the model size is large. 4.2 Preliminary Results The kernel design task is to find a valid and meaningful kernel for sequence data in two steps. The first step is to design a kernel for each sequence descriptor, and the second step to fuse multiple kernels in an optimal way Individual Kernel Design In this thread, we design new kernels for sequence-data learning. SVMs [116] are the most popular kernel-based methods, but SVMs can be applied only to training data that reside in a vector space. The basic form of an SVM

8 ,, = kernel function which classifies an input data is expressed as where is a mapping function which maps input vectors (1) into the feature space; operator denotes the inner product operator; is the training sample; is its class label; and is its Lagrange multiplier. A kernel function is represented by, and the bias by. For sequence data, in particular variable-length sequences, we lack the basis function for mapping sequences with various lengths to spaces of different dimensions. Fortunately, the embedding of a finite set of points is entirely specified by writing a finite-dimensional kernel matrix. Put another way, as long as we have a positive definite kernel matrix, which characterizes the sequencedata similarity, we can use kernel methods [80]. Hence, the design task is reduced to formulating a kernel matrix satisfying two requirements: the semantic requirement and the mathematical requirement. Regarding the semantic requirement, Kernel matrix must capture the similarity in local and global structure between the sequence data. As to the mathematical requirement, a valid kernel matrix must be symmetric and positive semi-definite [106] to ensure that the projected feature space does exist. P2 Figure 6: Example of Transitive Similarity. A natural way to define the similarity between two sequences is by using pair-wise string alignment scores [93, 111]. Two sequences with variable lengths can be aligned by matching symbols at corresponding positions and inserting - at the unaligned positions. An alignment is a mutual arrangement of two sequences, showing where the two sequences are similar, and where they differ. The more aligned two sequences are, the more similar they are. By performing alignment, given sequences, we can build a matrix, in which is the pairwise similarity between sequences and. However, there is one potential problem with matrix : though is symmetric, it might not be positive semi-definite. To remedy the problem, we initially propose to consider transitive similarity when measuring pair-wise similarity. To motivate our approach, Figure 6 provides an example of considering transitive similarity between data, with each node denoting one data instance in the 2-D P1 P3 space. Assume,! and #" form an equilateral triangle, which means that the distances between them are the same. However, if we take data distribution into consideration, we notice that more data are located between and#, or# and$". More likely, and$" than between and$" belong to the same class, and is an outlier. Therefore, we need to define a kernel matrix which can consider both pair-wise similarity and transitive similarity. Intuitively, a transitive relationship is helpful to characterize the similarity between data more accurately by considering data distribution. Furthermore, we have proved the following two important propositions, which show that when a given similarity is symmetric, taking transitive relationship into consideration will result in a legal kernel. Proposition 4.1 Denote% as the similarity between sequence and by using pair-wise string alignment scores. If a matrix is defined as: '& % $(*)+ -,. (2) $(*)+. 0/ then is a semantically valid kernel, reflecting the similarity relationship between sequences, including transitive similarity. Proposition 4.2 0/ is a mathematically valid kernel, which is symmetric and semi-positive definite Kernel Fusion After formulating individual kernels, the next step is fusing individual kernels. Each individual kernel extracts a specific type of information from given data, thereby providing a partial view of the data. Kernel fusion forms a complete picture of the relationship between different components of the original sequence-data. Assume 1 is a relation between instances and and their parts, i.e., 132 decomposes an instance into a set of D-tuples. Kernel, is the similarity between parts, and,. For different contexts, not all levels descriptors should be considered as having the same importance. We propose kernel fusion to provide the flexibility to learn which level should be more important according to the target learning task. Possible fusion rules are weighted sum and tensor product, since kernels have proven to be closed under sum and product. The weighted sum is formulated as # ;: <: = %, <:, (3), Tensor product formulation is defined as # <: ;: = <: 4.3 Future Work?> > ;: = (4) We have proved that our kernel design and kernel fusion algorithms both produce mathematically valid kernels, and the excitement has just begun. Regarding kernel design, our immediate plan is to verify the effectiveness of the exponential kernel through extensive empirical studies on several datasets including UCI time-series

9 data, video-surveillance motion trajectories, and RNA sequences. In addition to the exponential kernel, there are a couple of promising candidate kernels that we plan to investigate. Regarding kernel fusion, we currently find the best fusion model (e.g., the weights of individual kernels) through cross-validation, which can be time-consuming. We plan to investigate this fusion problem to see if it can be formulated and solved as a convex optimization problem. In addition, we have started investigating a nonlinear kernel fusion scheme in [125, 123], and the result is promising. 5 Context-based Distance Function Formulation Interpreting a video event is a context-dependent task. For instance, a vehicle circling in an empty parking lot in the night-time is suspicious, whereas the same pattern taking place in the day time, in a full or empty parking lot, may be benign. A stalking pattern in a parking lot may raise a security concern, but the same pattern happening in a grocery store can be incidental, and unharmful. Thus, videoevent recognition must consider contextual information. Context-based information access was identified as a key future database research area at a recent National Science Foundation workshop [2]. At the heart of context-based information access is the formulation of an application- and user-dependent distance function for measuring data similarity. An accurate measurement of similarity based on contextual information is essential to personalize many database tasks, such as clustering, indexing, and retrieval [8, 13, 73]. The quality of the distance function significantly affects the success in finding meaningful results [4, 5, 17, 47, 52, 78]. To date, the most widely used distance metric is perhaps the Euclidean distance because of its intuitive nature and ease of computation. In the last two decades, much work was devoted to transforming the Euclidean distance (or more generally, the -norm) by weighting the features based on their importance for a target task [5, 47]. Weighting the features is equivalent to performing a linear transformation on the space formed by the features. Linear models enjoy the twin advantages of simplicity of description and efficiency of computation. However, this same simplicity is insufficient to model similarity for many real-world datasets. For example, it has been shown in image/video retrieval that a query concept is typically a nonlinear combination of perceptual features ( [102, 114]). Linear models can be too restrictive for mapping features to semantics, and hence unsuitable. To support flexible and effective context-based distance-function formulation, we will research a nonlinear feature transformation procedure called kernel alignment. At first it might seem that nonlinear transformation could suffer from high model and computational complexity. Our kernel alignment procedure can successfully avoid these problems by employing the kernel trick. The kernel trick lets us generalize distance-based algorithms to operate in the feature space (defined next), usually nonlinearly related to the input space. The input space (denoted as ) is the original space in which data vectors are located, and the feature space (denoted as ) is that space to which the data vectors are projected, linearly or nonlinearly. The advantage of using the kernel trick is that, instead of explicitly determining the coordinates of the data vectors in the feature space, the distance computation in can be efficiently performed in through a kernel function. Specifically, given two vectors ( and, kernel function is defined as the inner product of and $, where is a basis function that maps the vectors and $ from to. The inner product between two vectors can be thought of as a # measure of their similarity. Therefore, returns the similarity of and $ in. The distance between and in terms of the kernel is defined as # # # # Since a kernel function can be either linear or nonlinear, the traditional feature-weighting approach (e.g., [5]) is just a special case of kernel alignment. What is a good distance function and how do we consider contextual information in formulating one such function to interpret video events? We aim to answer these questions in two thrusts, theory and algorithm. Theory thrust. We will derive theories to show how to optimally transform a given distance function by using contextual information. We are particularly interested in designing transformation methods that are both efficient and flexible in modeling complex distance functions. Algorithm thrust. We will develop models and algorithms to effectively collect contextual information, and then use the contextual information to transform distance functions. 5.1 Relationship to Current State of Knowledge Distance function learning approaches can broadly be divided into metric-learning and kernel-learning approaches. In the rest of this section we discuss representative work using these two approaches Metric Learning Metric learning attempts to find the optimal linear transformation for the given set of data vectors to better characterize the similarity between the vectors. The transformation by itself is linear, but the data vectors may first be mapped to a new set of vectors using a nonlinear function. The transformation of the data vectors is equivalent to assigning weights to the features of the vectors; therefore, metric learning is often called feature weighting. The metric learning approach is given a set of data vectors ( and similarity information in the form of

10 # (a similar set), if and are similar. Metric learning aims to learn a distance metric # between data vectors and that respects the similarity information. Mathematically the distance metric can be represented as # # (5) # # where needs to be positive (semi-) definite so as to satisfy metric properties non-negativity and triangle inequality. The choice of the basis function and the scaling matrix will differentiate the various metric learning algorithms. Wettschereck et al. [120] provide a review of the performance of feature-weighting algorithms with emphasis on their performance for the -nearest neighbor classifier. Here, we discuss only a few representative algorithms. (For the other algorithms, please refer to [120].) A number of papers in the current literature address the problem of learning distance metrics by using side information in the form of groups of similar vectors [11, 127]. Side information can be user-provided information on the similarity characteristics of a subset of data. The work of [11] uses Relevant Component Analysis to efficiently learn a full-rank Mahalanobis metric [89]. The authors use equivalence relations for the side information. They compute where is the mean of the group of vectors, and where and denote the number of groups and the number of samples in the -th group, respectively. The matrix 2 is used for transformation and the inverse of as the Mahalanobis matrix. Xing et al. [127] treat the same problem as a convex optimization problem, thus producing local-optima-free algorithms. They present techniques for learning the weighting matrix for both the diagonal and the full matrix cases. The major difference between the two approaches lies in the fact that RCA uses closed-form expressions, whereas [127] uses iterative methods that can be sensitive to parameter tuning but are computationally expensive. C. Aggarwal [5] discusses a systematic framework for designing distance functions sensitive to particular characteristics of the data. The models used are the parametric Minkowski model and the the parametric cosine model. Both these models attempt to minimize the error with respect to each. The parametric Minkowski model can be thought of as feature weighting in, and the parametric cosine model as the inner product in. In summary, metric learning aims to learn a good distance function by computing the optimal feature weightings in the input space. Clearly, this linear transformation is restrictive in terms of modeling complex semantics. Although one can perform nonlinear transformation on the features via a basis function in, the resulting computational complexity renders this approach impractical. The kernel learning approach, which we discuss next, successfully addresses the concern about computational complexity Kernel Learning Kernel-based methods attempt to implicitly map a set of data vectors in to some other highdimensional (possibly infinite) feature space, using a basis function (usually nonlinear):. Kernel is defined as an inner product between two basis functions and $ in,! #" # # Kernel-based methods use these inner products ( ) as a similarity measure. (Theoretical justifications are presented in [107].) The kernel provides an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in [107]. Any linear algorithm that can be carried out in terms of inner products can be made nonlinear by substituting an a priori kernel. A typical example is Support Vector Machines [24]. The requirement for choosing a valid is that it should be positive (semi-) definite and symmetric. The distance between and $ in can then be computed via the kernel trick ( $ # # ( ( An important advantage of using kernels stems from the ease of computing the inner product (similarity measure) in without actually having to know, as long as the chosen kernel is a positive (semi-) definite function. Much work [15, 24, 107] has been done in classification, clustering, and regression methods, using indirect computations of the kernel # and hence the distance #. Due to the central role of the kernel, a poor kernel function can lead to significantly poor performance. Cristianini et al. [40] introduced the notion of kernel alignment to measure the similarity between two kernels or between a given kernel and a target function. Geometrically, when given a set of training instances, the alignment score is defined as the cosine of the angle between the two kernel matrices 3, after flattening the two matrices into vectors. They also proposed the notion of ideal kernel (&% ), toward which any given kernel is supposed to be aligned. (We will discuss further in Section D.2.) Based on the idea of kernel alignment and ideal kernel, recently, a couple of researchers have begun to develop a process for learning a kernel directly from the 3 Given a kernel function ' and a set of instances (, the kernel matrix (Gram matrix) is the matrix of all possible inner-products of pairs from (, )+*-,/.$0 1324*5'!,7680:9;6<12.

11 training data, instead of choosing a prior kernel. Cristianini et al. [40] proposed a transductive learning [118] method to learn the kernel matrix by optimizing the eigenvalue coefficients for the spectral decomposition of the full kernel matrix on both training and test data. Crammer et al. [39] argued that the alignment may not be a good measure when the magnitude # is large. Therefore, they formulated the kernel matrix learning through a boosting process [34], so that a good kernel matrix can be constructed by weighting simple base kernels obtained from solving the generalized eigenvector problem. Both methods are transductive and may not generalize to unseen instances, unless they repeat the algorithm by incorporating those new instances into the training procedure (a rather inefficient method to classify an unseen data instance). In summary, though these recent methods suggest interesting ways to modify a given kernel function, some important questions still remain to be answered, such as Is the ideal kernel really ideal? Is the modification optimal? and Is the resulting kernel a valid one? We have recently developed a kernel alignment algorithm to perform linear transformation on prior kernel matrix. We have theoretically proven that our method leads to a valid distance function. We report our preliminary results and discuss future work in the next sections, D.2 and D Preliminary Results, where Let us consider a two-class classification problem with a training dataset and. We can consider as a similarity measure between instances and. For instance, when an RBF function is employed, the value of ( ranges from to, where when ( and are infinitely far away (dissimilar) in the input space, but when and are infinitely close (similar). The parameters associated with a kernel determine pairwise similarity measures. Thus, the choice of a good kernel and its parameters is equivalent to the choice of a good distance function for measuring similarity. When we have perfect knowledge about the class membership (denoted as ) of each instance, we can write the ideal similarity matrix generated by the ideal kernel % given in [40] as follows % &, An ideal kernel can be constructed on a training dataset where class labels are known. Unfortunately, the ideal kernel overfits the training data, so it cannot propagate the prior knowledge to unseen data. The work of [40] proposes to measure the alignment of a kernel with the ideal kernel using their Frobenius product +%. Kernel-alignment uses the alignment score to the ideal kernel % to indicate the degree to which the kernel fits the training data. However, since (7) the ideal kernel itself is a trivial kernel, aligning a function toward the ideal kernel ([79]) may not lead to improved results. We have devised a kernel alignment algorithm, which performs linear transformation on a prior kernel matrix. We have theoretically proven that our method leads to a valid distance function. More importantly, we have also shown that since the optimization is performed in a convex space, the solution obtained is globally optimal. Consequently, given a prior kernel and side information, the alignment needs to be performed just once. Our empirical results show that our proposed method outperforms competing methods on a variety of testbeds [121]. More specifically, we use a linear transformation model in, not in, to idealize. The kernel matrix is then modified as follows (where! ) &., (8)!! In what follows, we present two important propositions for which we have provided proof. Proposition 5.1 demonstrates that under some constraints on and $, our proposed idealized kernel in Eq. 8 is a valid kernel. Proposition 5.2 mathematically demonstrates that under the constraints from Proposition 5.1, the idealized in Eq. 8 guarantees a better alignment to the ideal % than to the prior. In both propositions, we assume that. This constraint means that we place more emphasis on decreasing kernel value for dissimilar instance-pairs. This constraint is in line with the spirit of achieving maximum distance between dissimilar pairs to keep the separating margin large. is positive definite if the prior kernel is positive definite. Proposition 5.1 Under the assumption that!, the idealized kernel Proposition 5.2 The kernel matrix of the idealized kernel obtains a better alignment than the prior kernel ma-. trix to the ideal kernel matrix %, if Moreover, a smaller or would induce a higher alignment score. 5.3 Future Work We plan to further address the following three research issues: Ideal kernel. We are not satisfied with the ideal kernel proposed by Cristianini et al. [40]. We will investigate what a better ideal kernel might entail. Our preliminary conjecture is that an ideal kernel needs to take data distribution into consideration in order to avoid the problem of overfitting. We believe that the exponential kernel (discussed in conjunction with task B), which has a strong link to the graph theory, can potentially provide a data-dependent ideal kernel. We will identify other candidates and conduct extensive experiments to validate

Event Sensing on Distributed Video Sensor Networks

Event Sensing on Distributed Video Sensor Networks Edward Chang Associate Professor Department of Electrical and Computer Engineering University of California Collaborator Prof. Yuan-Fang Wang Industry