SURVEILLANCE OF TIME-VARYING GEOMETRY OBJECTS USING A MULTI-CAMERA ACTIVE-VISION SYSTEM

Size: px

Start display at page:

Download "SURVEILLANCE OF TIME-VARYING GEOMETRY OBJECTS USING A MULTI-CAMERA ACTIVE-VISION SYSTEM"

Anne Horton
5 years ago
Views:

1 SURVEILLANCE OF TIME-VARYING GEOMETRY OBJECTS USING A MULTI-CAMERA ACTIVE-VISION SYSTEM Ph.D. Thesis Candidate Matthew Mackay Supervisor Professor B. Benhabib Department of Mechanical and Industrial Engineering June 2011 A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Mechanical and Industrial Engineering University of Toronto Copyright by Matthew Mackay, 2011

2 Surveillance of Time-Varying Geometry Objects Using a Multi-Camera Active-Vision System Matthew Mackay Ph.D. Thesis, 2011 Department of Mechanical and Industrial Engineering University of Toronto ABSTRACT This thesis proposes a multi-camera active-vision system, which dynamically selects camera poses in real-time to improve time-varying-geometry (TVG) action sensing performance by selecting camera views on-line for near-optimal sensing-task performance. Active vision for TVG objects requires an on-line sensor-planning strategy that incorporates information about the object and the state of the environment, including obstacles, into the pose-selection process. Thus, this research is designed specifically for real-time sensing-system reconfiguration for the recognition of a single TVG object and its actions in a cluttered, dynamic environment, which may contain multiple other dynamic (maneuvering) obstacles. The proposed methodology was developed as a complete, customizable sensing-system framework which can be readily modified to suit a variety of specific TVG action-sensing tasks a 10-stage real-time pipeline architecture. This pipeline consists of Sensor Agents, a Synchronization Agent, Point Tracking and De-Projection Agents, a Solver Agent, a Form-Recovery Agent, an Action-Recognition Agent, a Prediction Agent, a Central Planning Agent, and a Referee Agent. In order to validate the proposed methodology, rigorous experiments are also presented herein. They confirm the basic assumptions of active vision for TVG objects, and characterize gains in sensing-task performance. Simulated experiments provide a method for rapid evaluation of new sensing tasks. These experiments demonstrate a tangible increase in single-action recognition ii

3 Abstract performance over the use of a static-camera sensing system. Furthermore, they illustrate the need for feedback in the pose-selection process, allowing the system to incorporate knowledge of the OoI s form and action. Later real-world, multi-action and multi-level action experiments demonstrate the same tangible increase when sensing real-world objects that perform multiple actions which may occur simultaneously, or at differing levels of detail. A final set of real-world experiments characterizes the real-time performance of the proposed methodology in relation to several important system design parameters, such as the number of obstacles in the environment, and the size of the action library. Overall, it is concluded that the proposed system tangibly increases TVG action-sensing performance, and can be generalized to a wide range of applications, including human-action sensing. Future research is proposed to develop similar methods to address deformable objects and multiple objects of interest. iii

4 To my parents, Donald and Kathy Mackay, your constant encouragement and guidance have helped me to become the person I am today. It is because of you that I can pursue a field which I truly enjoy working in. Thank you for everything. To my fiancé, Jing Shen, you have made every day since I met you infinitely more enjoyable. Without your support, I could never have finished this thesis, and without you, I would have no one to share my success with. Thank you for everything. iv

5 ACKNOWLEDGEMENTS Firstly, I would like to thank my thesis supervisor, Professor Beno Benhabib, for his continued support throughout my thesis program. Through his helpful advice and guidance, I have been able to see this program through to its completion. The valuable input of my professors and thesis committee, Prof. R. Ben Mrad, Prof. F. Ben Amara, Prof. J. K. Mills, and Prof. G. Nejat, was also instrumental in developing this dissertation, and is greatly appreciated. Also, I would like to express my gratitude for those who developed the experimental platform and test-bed before me, including Ardevan Bakhtari and Michael D. Naish. I also thank Hans de Ruiter for his valuable input in designing the human analogue, Jeffery Nagashima for his subsequent development, and Thuwaragan Sritharan for his final implementation and thesis project. Similarly, I would like to thank Cecilia Chen Liu for her helpful thesis project work on intelligent parameter selection. My friends and colleagues in the department and CIMLab, including Hans de Ruiter, Faraz Kunwar, Usman Ghuman, Ashish Macwan, Yufeng Ding, Patricia Sheridan, Cecilia Chen Liu, Christopher Wong, Christopher Hawryluck, Gelareh Namdar, Pawel Kosicki, David Schacter, Hay Azulay, Ji Ke, Anson Wong, Stephanie Deng, and Sean Bittle have enriched my studies at University of Toronto, and provided generous help and support throughout my program. I must also thank my friends and colleagues outside the department, especially including Sinisa Colic, Ian Pang, and Bogdan Simion for their advice and support. Lastly, I gratefully acknowledge the financial support of the Natural Science and Engineering Research Council of Canada (NSERCC) and the University of Toronto Graduate Fellowship program. Without their gracious funding, I would not have been able to complete my degree. v

6 TABLE OF CONTENTS ABSTRACT... ii ACKNOWLEDGEMENTS... v TABLE OF CONTENTS... vi LIST OF TABLES... xi LIST OF FIGURES... xii NOMENCLATURE AND ACRONYMS... xiv 1. Introduction Motivation Research Objective Action Recognition Literature Review Surveillance Object Identification and Tracking Object Recognition Action Recognition Sensing-System Reconfiguration Off-line, Static Environment, Fixed-Geometry Reconfiguration On-line, Static Environment, Fixed-Geometry Reconfiguration Dynamic Environment, Fixed-Geometry Reconfiguration Multi-Target, Dynamic Environment, Fixed-Geometry Reconfiguration Static Environment, Single TVG Target Reconfiguration Dynamic Environment, Single TVG Target Reconfiguration Applications Human Action Sensing Research Tasks Summary of Contributions Dissertation Overview vi

7 2. Problem Definition Overview TVG Object- and Subject-Action Representation Action-Recognition Task and Quantization TVG Action-Sensing Tasks Next-Best-View Problem and Sensing-System Reconfiguration Tasks General and TVG Object-Specific Issues Next-Best-View Problem Sensing System Tasks Base Problem Model Performance Optimization Performance as a function of Visibility Final Optimization Problem Constraints Physical Constraints Time Constraints Other Constraints Complete Optimization Visibility Metric Past Visibility Metrics Visibility Metric Properties Metric Distance Metric Angle Metric Visible Area Model-Based, Multi-Sub-Part Metric Summary Customizable TVG Action Sensing Framework vii

8 3.1 Pipeline Background Pipeline Architecture Overview Update Structure Pipeline Depth and Superscalar Execution Pipeline Stages Stage L1 Imaging Agent Stage L2 Synchronization Agent Stage L3 Point Tracking Agents Stage L4 De-projection Agents Stage L5 3D Solver Agent Stage L6 Form Recovery Agent Stage L7 Prediction Agent Stage L8 Central Planning Agent Stage L9 Action Recognition Agent Stage L10 Referee Agent Summary Single-Action Sensing for Human Subjects Initial Single-Action Sensing Methodology Central Planner Architecture CPA, Optimization, and the Visibility Metric Simulation Environment Object Modeling Sensor Modeling Target Model Simulated Experiments Experimental Set-up Simulation Results viii

9 4.4 Summary Multi-Action Sensing and Multi-Level Action Recognition Multi-Action and Multi-Level Recognition Methodology Sensor Agents and Central Planning Agent Form Recovery and Action Recognition Agents Pose Prediction and Referee Agents Human Analogue System Calibration Base Calibration Methodology System Model Calibration Process Real-world Experiments Experimental Setup Basic Real-world Experimental Results Multi-Action Experimental Results Multi-Level Action Experimental Results Summary Real-time Human Action Recognition Real-time Methodology Real-time Constraints Formal System Evaluation System Comparison Evaluation Environment and TVG Analogue Design Real-world, Real-time Experiments Experimental Implementation Experimental Environment and Set-up Real-Time Multi-Action Sensing and Comparison to Past Results ix

10 6.3.4 Real Human Action Sensing Experiments Real-Time Performance Characterization Summary Conclusions and Future Work Summary and Conclusions Customizable Framework Single-Action Sensing Multi-Action and Multi-Level Action Sensing Real-time Sensing Future Work Hand Gesture Recognition and Improved Multi-Level Recognition Facial Expression Recognition and other Deformable Object Actions True Multi-Subject Action Sensing References Appendix A Common Pre-Processing Filters and Implementations Appendix B Common Interest Filters and Implementations Appendix C Common Image Compression Algorithms Appendix D Overview of Feature Point Detection Methods Appendix E Overview of Feature Point Tracking Methods x

11 LIST OF TABLES Table 1.1 Summary of Contributions Table 2.1- Code Listing for Action Classification Table Code Listing for Per-Instant Visibility Optimization Table 2.3 Code Listing for Completed Core Optimization Task Table 5.1 Pseudo-Code Listing for Dsitributed Form Recovery Process Table 5.2 Pseudo-Code Listing for Proposed Novel System Calibration Method Table 5.3 Nomenclature for Calibration Method Pseudo-Code listing Table 5.4 Results for Human Gait-Based Identification Multi-Action Trial Table 5.5 Results for Random, Simultaneous Action Trial Table 6.1 Results for Experiment 2, Real-time Surveillance of Real Human Table 6.2 System Metrics for Increasing dof in Experiment xi

12 LIST OF FIGURES Figure 2.1 Graphical View of the Surface Equation for a Volume Figure 2.2 Overview of Differences in Articulated and Deformable Objects Figure 2.3 Comparison of Object Models for a Human Arm Figure 2.4 State-Machine Representation of Object Actions Figure 3.1 Overview of Proposed Customizable Pipeline Architecture Figure 3.2 Stage L1 Internal Pipeline and Sub-Blocks Figure 3.3 Projection of Feature Point Into Pixel Coordinates Figure 3.4 Iterative Form Recovery Process Flowchart Figure 3.5 Action Recognition State Machine Figure 4.1- Overview of Proposed Single-Action. CPA-Based Architecture Figure 4.2 Clipped Projection Plane for Visibility Metric Calculation Figure 4.3 Overview of Sensor Assignment for Initial Methodology Figure 4.4 Un-occluded Area Calculation for Initial Visibility Metric Figure 4.5 Sample Simulated Object Models Figure 4.6 (Left) Skeletal Model (Middle) Human Figure (Right) Simulated Figure Figure 4.7 Overview of Simulated Active-vision environment Figure 4.8 Obstacle Locations and OoI Path for Simulated Experiment Figure 4.9 Obstacle Locations and Paths for Simulated Experiment Figure 4.10 Comparison of Error Metric over 100 Simulated Demand Instants Figure 4.11 Camera Views for Demand Instant 85, Simulated Experiment Figure 4.12 Visibility Metric Calculation for Demand Instant 85, Experiment Figure 4.13 Error Metric Plot for Ideal Prediction, Static Obstacles Figure 4.14 Error Metric Plot for Ideal Prediction, Dynamic Obstacles Figure 4.15 Error Metric Plot for Real Prediction, Static Obstacles Figure 4.16 OoI Pose Estimation for Static Obstacle, Real Prediction Trial Figure 4.17 Error Metric Plot for Tracking Failure Trial Figure 4.18 OoI Pose Estimation for Tracking Failure Trial Figure 4.19 Error Metric Plot for Real Prediction, Dynamic Obstacles Figure 4.20 OoI and Obstacle Pose Estimation for Dynamic Obstacle Trial Figure 4.21 Comparison of Error Metric Plots for Experiment Figure 5.1 Overview of Multi-Action and Multi-Level Action Methodology xii

13 Figure 5.2 Overview of Human Analogue for Initial Real-world Experiments Figure 5.3 Coordinate Systems and Additive Error Figure 5.4 Example Camera Motion Stage Mis-Alignment Figure 5.5 System Calibration Results for Sample Calibration Figure 5.6 Overview of Real-world Active-Vision Environment Figure 5.7 Obstacle Locations and OoI Path for Simulated Experiment Figure 5.8 Simulated and Experimental Results for No Obstacles Baseline Trial Figure 5.9 Experimental Results for Static Obstacles Baseline Trial Figure 5.10 Movie-Strip view of Sample Demand Instant Figure 5.11 Experimental Results for Dynamic Obstacles Baseline Trial Figure 5.12 Sample Pointing Motion for Trial Figure 5.13 Scatterplot of Results for Secondary Parameter Estimation Figure 6.1 Abstract overview of Pipeline Architecture Figure 6.2 Human Analogue Designed for Future Real-world Experiments Figure 6.3 Human Analogue Verification Experiment Overview Figure 6.4 Comparison of Error Metric for Human Analogue and Real Human Figure 6.5 Sensor Pose Comparison for Human Analogue and Real Human Figure 6.6 Top-down view of Real-TIme Experimental Environment and Setup Figure 6.7 Obstacle Location Overview for Experiment Figure 6.8 Modified Sensor Layout for Real-Human Trials Figure 6.9 Results for Experiment 1 Testing Real-Time System Figure 6.10 Effect of Number of dof on Minimum Update Interval Figure 6.11 Minimum Update Interval and The Effect of Obstacles Figure 6.12 Minimum Update Interval and Library Size Effect Figure A.1 Effect of Gaussian LPF at Various Strengths Figure A.2 Example Applications for Notch Filter Reference Implementations Figure A.3 Brightness and Contrast Normalization Reference Method Example xiii

14 NOMENCLATURE AND ACRONYMS Latin Symbols ,.. Major axis of elliptical tracking area Set of connections in connection graphs Action feature vector (continuous-time) Action feature vector (discrete-time) An action feature vector with multiple simultaneous actions (combined action) Set of all achievable poses for the system The action feature vector in a numbered sequence The complete action feature vector from a given library of actions Maximum total acceleration for sensor Maximum possible surface area of the OoI visible from a given view Acceleration vector for the sensor in a system Minor axis of elliptical tracking area Average pixel brightness Blue color for pixel (, ) Pixel brightness range Proportionality constants, uncertainty estimate, term Closest point to which satisfies constraint, object coordinates Center of an object in environment, world coordinates Center of a sensor, world coordinates Center of camera image, pixel coordinates, -direction Center of camera image, pixel coordinates, -direction Clamping function [Equation (3.10)] Average frame-to-frame pixel displacement Upper feasible limit travel distance Lower feasible limit travel distance Uncertainty in estimated form Sum of squared differences for form feature vector Metric of constraint quality xiv

15 (..).... (..) Maximum possible distance from camera focus to OoI Generic error metric value Error in constraint, GM estimator Average pose-set evaluation rate Camera focal point, world coordinates Gaussian Random Variable, representing average time for a pose decision Function to determine achievable camera poses Frequency of completed pose decisions Decision criteria function, relating observed subject action to library actions Object surface function, representing depth from center to arbitrary point Frequency of incomplete decisions Closed-form function relating visibility for sensor to sensor pose Inverse stage transform function Effective average system update rate Maximum average update rate Minimum average update rate..,,.., Second lower limit on system update rate Camera focal length in horizontal pixels Camera focal length in vertical pixels [Equation (2.12)] Combination function for visibility metrics Gaussian kernel component Gaussian function, single dimension Gaussian function, two dimensions Set of functions relating visibility to performance Green color for pixel (, ) Arbitrary, monotonically-increasing function relating performance to visibility Extrinsic model function order Minimum interest level Pixel interest for pixel (, ) Raw image matrix Iteration counter of current loop, Maximum extrinsic estimation iterations xv

16 , Maximum intrinsic estimation iterations,,,,,.... Maximum overall estimation iterations Extrinsic model function order Total number of interest filters [Table 5.2] Extrinsic model function order Intrinsic camera calibration matrix Vector of lens distortion coefficients The lens distortion coefficient Horizontal Sobel kernel Vertical Sobel kernel Seperated Gaussian kernel, component Seperated Gaussian kernel, component Predicted feature point location for feature, predictive filter Translation coefficient function for -element, ) Translation coefficient function for -element, ) Translation coefficient function for -element, ) Lower pan limit of camera Upper pan limit of camera Lower tilt limit of camera Upper tilt limit of camera Length of projected OoI un-occluded segment Lower limit of motion axis Upper limit of motion axis Library feature vector element (, ) The level mip-map image Number of pixels, -direction Estimate of pixel motion, component Estimate of pixel motion, component, as function of location Estimate of pixel motion, component Estimate of pixel motion, component, as function of location Predicted feature point location for feature, model prediction xvi

17 .. Number of mip-map levels [Equation (2.1)] Object surface depth vector Number of assigned points Number of actions in an arbitrary action library Number of distinct area of the OoI surface that are visible in a 2-D projection Number of cameras in a system Number of model dof Number of feature points Number of pose sets to fuse Number of joints in an articulated object model Number of frames Number of points in model Number of subjects Number of missing points Number of actions Number of obstacles in an environment Number of pixels, -direction Number of points to recover Number of discretized positions Number of sensors in sensing system Number of constraints Number of demand instants Number of unassigned points Number of sub-parts in visibility metric calculation Number of sensors in sensing system Combined objective function for visibility optimization Portion of a task that can be performed in parallel [Equation (2.7)] Vector of sensor parameters [Equation (6.40)] Input image [Equation (6.40)] Output image Number of parallel processing elements Vector of pixel coordinates xvii

18 ,,,,,,,,,,, Lower achievable limit on sensor travel, sensor Upper achievable limit on sensor travel, sensor System sensor pose selected when sensing a human analogue System sensor pose selected when sensing a real human Set of all feasible poses for system Pixel color vector for pixel (, ) Raw pixel value for pixel (, ), color channel Interest-filtered pixel value for pixel (, ), color channel Interest-filtered pixel value, pixel (, ), color channel, sensor, instant Synthetic pixel value, pixel (, ), color channel, sensor The potential pose set, Referee Agent Lower physical limit on sensor travel, sensor Pose of obstacle at the demand instant Pose of the OoI at the demand instant World coordinate, 6-dof pose of the sensor at demand instant.. Upper physical limit on sensor travel Sensing-task performance Sensing-task performance. Performance when static cameras are used Performance when using active-vision system Performance of the system for comparison Fused predicted location for feature Absolute difference in performance Maximum quantized pixel value Minimum quantized pixel value Sum of squared normalized camera coordinates Generalized rotation and scaling matrix [Table 5.2] Extrinsic model rotation function order Decision ratio The, ) component of a generalized rotation matrix Stage rotation matrix xviii

19 , Red color for pixel (, ) Average connection length Circular tracking region radius Constant rotation offset for φ The rotation function coefficient for Constant rotation offset for ψ The rotation function coefficient for Constant rotation offset for θ The rotation function coefficient for Un-rotated reference point location from model library, point Speedup Pixel displacement standard deviation Sample standard deviation of time to complete one pose decision Subscript, referring to the sensor Arbitrary time Stage translation vector World time for current demand instant World time for next demand instant Lower tangency point for OoI Timestamp for sensor, arbitrary instant (early) Upper tangency point for OoI Timestamp for sensor, arbitrary instant (late) Local image timestamp Time normalization constant Time differential for local hardware to synchronization block Time difference between adjacent Demand Instants Library time for end of action Measured time for end of action Arbitrary library action end time Rolling horizon length in Demand Instants The constant axis offset in camera coordinates [Table 2.3] Demand instant world time for instant Time penalty representing system overhead xix

20 Arbitrary library action start time Time to reach maximum travel limit, first direction Time to reach maximum travel limit, second direction Minimum update interval Latest observed time for sensor Time spent processing a system pose decision Difference in time between next two instants Time of request Synchronized world timestamp Library time for start of action Measured start time of action World time Time spent waitin Outlier threshold for pixel coordinates The component of a generalized translation Recovered subject form feature-vector for instant Action feature vector, action The ordinal ranking of multiple systems for comparison Generic velocity Set of vertices in connection graph Current sensor velocity (one dof only) Total, fused visibility in Referee Agent, potential pose set Combined visibility metric for sensor Visibility of the sub-part of a multi-part OoI, sensor, demand instant Maximum speed for sensor Minimum camera visibility limit Velocity vector for the sensor in a system Visibility metric of the sensor in a system Surface area of the visible projected part of the OoI Variance in joint angle 1 Variance in joint angle 2 Angle sub-metric value xx

21 .... Visible area sub-metric value Distance sub-metric value The sub-metric of visibility Visibility metric area weight Visibility metric angle weight Visibility metric distance weight The weight for the corresponding visibility metric sub-part Weight of the object sub-part in multi-part visibility calculation Over-scaling weight factor Constraint weighting factor, constraint Weight for prediction of feature point location [Equation (2.4)] Continuous-time 3-D joint position function for object model joint, used for construction action feature vectors [Equation (2.6)] Discrete-time 3-D joint position function for object model joint, used for construction action feature vectors [Equation (2.1)] Arbitrary object surface point, X-coordinate General position vector General velocity vector General acceleration vector Initial 1-dof position The assigned featore vector dimension Lower final achievable limit Upper final achievable limit Normalized camera coordinate, -direction [Equation (2.1)] Object center-of-mass, X-coordinate Vector of constant stage offsets Constant -translation offset Parametric constraint equation, vector Vector of distorted camera coordinates Final detected feature point location Lower feasible limit Upper feasible limit xxi

22 ,, Gaussian noise vector Detected feature point location, world coordinates Detected feature point location, rotated object coordinates Detected feature point location, object coordinates Feature point vector, repetition, subject, keyframe, feature, action Kalman filter state space Lower achievable limit (un-cropped) The library dimension Lower achievable limit, vector Upper achievable limit, vector Achievable lower limit, single dof Feasible lower limit, single dof Measured target position Nearest model point to feature point location Vector of normalized camera coordinates Normalized camera coordinate, -component Vector of observed pixel coordinates for m th point in dataset Estimated OoI form feature vector element Vector of pixel coordinates Observed pixel coordinates for a given OoI point, -coordinate Principal point 1, pixel coordinate, elliptical tracking area Principal point 2, pixel coordinate, elliptical tracking area Vector of estimated pixel coordinates from projecting Elliptical area center, pixel coordinate Predicted tracked location Vector of intermediate, de-rotated coordinates Pose vector for camera at m th point Known world x-coordinates of the camera movement stage for the m th dataset point Known world -coordinates of the m th OoI point Vector of known world coordinates of the m th OoI point Closest point to solution which satisfies 3-D solver constraint, x-component xxii

23 True location of target point Upper achievable limit (un-cropped) Achievable upper limit, single dof Feasible upper limit, single dof Vector of world coordinates World coordinates, X-coordinate Paramtetric world coordinates, -coordinate Predicted point location, sensor, world coordinates.. Detected point location, sensor, world coordinates Set of four reference points in recovered model, world coordinates.. Set of four reference points in recovered model, rotated object coordinates [Equation (2.1)] Arbitrary object surface point, Y-coordinate Normalized camera coordinate, -direction [Equation (2.1)] Object center-of-mass, Y-coordinate Constant -translation offset Normalized camera coordinate, -component Observed pixel coordinates for a given OoI point, -coordinate Elliptical area center, pixel coordinate Principal point 1, pixel coordinate, elliptical tracking area Principal point 2, pixel coordinate, elliptical tracking area Known world y-coordinates of the camera movement stage for the m th dataset point Known world -coordinates of the m th OoI point Closest point to solution which satisfies 3-D solver constraint, y-component World coordinates, Y-coordinate Paramtetric world coordinates, -coordinate [Equation (2.1)] Arbitrary object surface point, Z-coordinate Normalized camera coordinate, -direction [Equation (2.1)] Object center-of-mass, Z-coordinate Constant -translation offset Known world z-coordinates of the camera movement stage for the m th dataset point Known world -coordinates of the m th OoI point Closest point to solution which satisfies 3-D solver constraint, z-component xxiii

24 Paramtetric world coordinates, -coordinate World coordinates, Z-coordinate Greek Symbols Δ... Angle between axes in pixel coordinates [Table 2.2] Distance measure between current OoI action and library action Uncertainty in predictive filter prediction Uncertainty in model prediction [Table 2.2] Maximum distance for positive action match Upper limit on difference between selected poses Upper limit on average difference between selected poses World rotation (first component angle) World rotation (second component angle) World rotation (third component angle) Angle from camera center line to OoI center Maximum possible angle from camera to OoI center Known rotation of the camera movement stage for the m th point Known rotation of the camera movement stage for the m th point Known rotation of the camera movement stage for the m th point Delta change in value given by ( ) from last innermost loop iteration Minimum change in intrinsic parameters Joint angle 1 for model constraints Joint angle 2 for model constraints Minimum angular variance for joint dof 1 to be included Minimum angular variance for joint dof 2 to be included Standard deviation, constant, GM estimator Acronyms CoM CT DNC dof Center of Mass Continuous Time Do-Not-Care (state) Degree(s) of Freedom xxiv

25 DT FTM HPF KF LPF NBV OoI RoI SNR TVG Discrete Time Flexible Tolerance Method High-pass Filter Kalman Filter Low-pass Filter Next-Best-View (as in Next-Best-View Problem) Object of Interest (also labeled OI in some works) Region of Interest (also labeled RI in some works) Signal-to-Noise Ratio Time-Varying Geometry xxv

26 1. Introduction Understanding and interacting with the world is a task that is difficult to define mathematically or algorithmically. People, as human beings, tend to define it intuitively in terms of known human actions and thought processes. Individuals also seek to emulate these processes through automated systems. It is the development of these systems that has been the subject of intense academic research over the past half-century. Even today, it is not known if a robot or an automated system will ever be capable of all tasks that a human is, let alone tasks which have not yet been imagined. However, the development of systems which can mimic basic human sensory capabilities is critical to the creation of more complex systems in the future. 1.1 Motivation Understanding is a natural human process, which combines visualization and cognition. Humans combine many disparate data sources in understanding even the simplest of scenes. In particular, the human brain combines context, such as knowledge of underlying processes, visual cues, and other data in a non-deterministic manner when determining the content of a scene [1]. Vision, and by extension, understanding of vision, is crucial to many human actions. It provides feedback and a tool for learning and interacting with the environment [2]. Thus, it is also a core task for any automated system seeking to emulate human abilities. The field of Computer Vision provides some of the first works which recognized this fact. Work began by defining a limited problem, termed image understanding [3]. The goal of image understanding is to, in an automated manner, examine and interpret the totality of the visual data available in a single, given image. By extension, an automated system should then be able to use this data to make decisions or to reason about the underlying scene, geometry, objects, etc. It was soon realized that this definition is limited, as it does not reflect the human visual cognition process. Humans do not view static images; they examine the complete scene, including its changes over time. They also interact with the environment to enhance their understanding. As such, scene understanding was defined [4], [5], wherein the goal of an automated system is to completely understand a scene, including its changes over time and the underlying processes behind these changes. To do so, the system may interact with the environment to uncover more data. This is where the goal of this work lies to interact with the environment in an automated manner. Generalized scene understanding encompasses a significant breadth of related problems. The tendency of the field is to pose each task as a semi-distinct problem, due to the inherent difficulty in 1

27 2 creating a monolithic system like the human brain. The recognition of objects is one application area which has seen significant research in the past two decades. Automated systems have been developed which are capable of automated recognition of one or more objects in a given environment [6]. This task is of particular interest, since it is an atomic task that can be used to construct more complex functionality. To interact with an object, to pick it up with a robot, track its location, or to simply examine it and draw some conclusion, any automated system must first be able to recognize this object from others in the environment [7]. Not all objects are identical and, thus, there is further sub-division of this problem. Of particular interest are objects which change in shape, form, or general appearance over time. Such objects are herein termed time-varying-geometry (TVG) objects, owing to the time-varying nature of their form or structure. These objects are of interest since they comprise a significant proportion of the population of all objects. One of the most ubiquitous examples is a human; a human may change in appearance over the short and long term, through many processes, including human actions. Furthermore, these objects present a significant challenge to an automated sensing system. As will be examined in Section (1.3), sensing TVG objects presents issues unique to these objects. TVG objects are also unique in that they may exhibit distinct, repeatable sequences of appearance or form, which are termed actions. It is highly desirable to recognize these actions, as they provide significant information that can be used in understanding a scene. Action recognition has been applied to a variety of tasks. For example, human action and gait recognition can be used as a biometric to uniquely identify human subjects [8]. Human actions, such as facial expressions and hand gestures, can convey significant information in a given human-human or human-machine interaction. As such, automated systems have been proposed to recognize and interpret such actions (e.g., [9], [10]). Action recognition can also provide visual feedback for robot or process control (e.g., [11]), or a method for teaching certain tasks to an automated system (e.g., [12]). Other example applications include surveillance and security applications (i.e., monitoring an airport for hostile actions, or watching a casino for cheating players). Safety-centric applications, such as monitoring for unsafe interactions between machinery and humans in a factory setting may also be possible. TVG action recognition can also be applied in surgical and medical applications, such as patient stabilization, robot-assisted surgery, and even rehabilitation. These applications represent only a small portion of the totality of the field. 1.2 Research Objective Given that one might wish to recognize a TVG object and its actions, the problem at hand would seem to be relatively simple to pose. However, it is the breadth of the types of TVG objects that

28 3 makes this task difficult. While all TVG objects share common characteristics, there may be significant variation in the appearance and actions that constitute the object. Past work first focused on identifying characteristics that are common to all TVG objects. One of the earliest findings is that, as with fixed-geometry objects, not all views of a TVG object are of equal importance when performing object or action recognition. In essence, viewpoints are differentiated in their importance to the sensing process [13]. For fixed-geometry objects, there are many factors which have been identified that directly affect which views are most useful. When recognizing a fixed-geometry object, some un-occluded views simply provide more useful information about the object (e.g., a combination of frontal and side facial views is preferred by most face recognition algorithms, [14]). Occlusions also differentiate views; highly occluded views provide little information about the object, and are thus of little use to the sensing process [15]. Likewise, these and other general-case factors affect TVG objects. However, their effect on the sensing process tends to be much greater when compared to fixed-geometry objects. As a continuation of this effect, TVG objects also exhibit viewpoint importance differentiation over time. As mentioned above, TVG objects are unique in that they may perform actions, changing their appearance or form over time. The same relative view of a TVG object, taken at two different instants in time, may not necessarily contain the same information [16]. Thus, viewpoints are further differentiated in their importance, as the object must be continuously sensed to recognize TVG object actions. This also has the side effect of further increasing basic view differentiation, as some views convey more useful information than others for distinguishing two actions performed by the same TVG object. TVG objects may also self-occlude during an action [17]. Given these issues, the complete task at hand can be broadly defined as the recognition of TVG objects and their actions, in cluttered, real-world environments, containing multiple other static and/or dynamic, fixed-geometry and/or TVG (view-obstructing) objects. The development of a generic methodology and system design to carry out this task efficiently and effectively constitutes the novel contribution of this work. In Section (1.3), a detailed review of past and current methods for TVG object and action recognition will be presented. As will be shown, the vast majority of these methods assume that the input data, the sensor data or camera images, is fixed. These methods focus on robustness to the above factors, but make no attempt to mitigate their presence before recognition begins. The true goal of this research is to develop a formal method to do just that focus on improving the input data first, before effort is expended on extracting useful data from the input. The novel method proposed to accomplish this task, which is a form of sensing-system reconfiguration, will be discussed later in Section (1.4).

29 4 1.3 Action Recognition Literature Review Before developing the novel methodology that is the core of this work, it is prudent to examine past work in automated recognition of TVG objects and actions. This section will specifically focus on the limitations of these methods, in relation to the input data. In doing so, it will be demonstrated that there is significant potential to improve the performance of these recognition efforts by expending a reasonable amount of effort on improving the input data first. The information presented herein will also be used in Chapter 2 to develop the core sensing problem into a formal, mathematical form Surveillance The core concept of this research field is surveillance, the collection and analysis of sensor data to estimate the object parameters of pose (position and orientation) and form, in order to uniquely identify the object, and categorize its current action. As part of this concept, the recognition of TVG objects includes the identification of both static forms, and the recognition of sequences of varying forms. In current research, the vast majority of such time-varying geometry objects are humans, although a few papers have focused on simple artificial shapes (e.g., [18]), robots and automated machines (e.g., [19]), and non-human living subjects (e.g., [20]). However, the logical starting point for research in this field was the identification of a single static form, regardless of the choice of subject. This directly extends the fixed-geometry object recognition problem, as a single, static form could be modeled as a distinct object under a fixed-geometry methodology. Naturally, this would require an existing database of characteristic data and object forms (i.e., an action library). Thus, past work focused on merely reconstructing the model of the unknown object in the absence of any a priori information (e.g., [21]). In [22], an active-vision (moving camera) system was used to explore a static scene automatically, in order to reconstruct the shape of a geometric primitive. While successful in the task at hand, this method does not make use of any contextual information about the object, and assumes that it is stationary (essentially, it is still a fixed-geometry object). Other similar methods tend to be application-specific, as a common application of such methods is for 3-D modeling of objects for use in other environments, such as computer graphics (e.g., [23]). However, these methods introduce a concept that is of critical importance to TVG action sensing. It was mentioned in Section (1.2) that viewpoints are differentiated in importance over time, since actions require continuous sensing. Given the current state of an active (movable) sensing system, the current form of the object of interest (OoI), and contextual information (i.e., the OoI action), it is possible to determine a future system state which maximizes the amount of unknown, useful information about the OoI that is uncovered. This is termed the Next-Best-View (NBV) problem, and

30 5 it is a core concept for the active-vision systems that follow in Section (1.4). Optimizing the amount of unknown information about an object uncovered by each subsequent reconfiguration of an activevision system is the basic method by which many sensing-system reconfiguration methods improve input data [24]. These works also mark a point of divergence between sensing-system reconfiguration and the computer vision methods that follow. Subsequent works began to assume that all sensors in the sensing system are static, and by extension, that input data is taken to be fixed. The next logical extension to the sensing problem at hand is object identification. The above methods assumed that the environment was un-cluttered, and contained only the object of interest Object Identification and Tracking Real-world, cluttered OoI identification is a non-trivial problem, as the system may not know a priori which portions of the incoming sensor data are associated with the OoI. In essence, the problem at hand is a segmentation problem, wherein the image must be segmented into areas that are of interest (i.e., belong to the OoI), and are not of interest (i.e., background clutter, other objects and obstacles). Given this information, the area of interest representing the object can be tracked over time. Early methods of OoI identification often depended on a priori knowledge of the object, or on specialized objects. For example, early methods used color-based segmentation (e.g., [25]), or feature-point markers (e.g., [26]) to detect the OoI in a sequence of 2-D camera images. Such methods inherently depend on a specific, customized object, and are thus not applicable for all TVG objects that one might sense with automated system. Modern methods, such as [27], implement more complex schemes to identify an object based on generalized descriptive information, in a manner similar to how a human might describe a class of object. In [28], a set of customizable interest filters is used to construct interest maps based on generalized characteristics of the OoI. In general, OoI identification has become a distinct area of research, with multiple generalized and applicationspecific methods available. Any fixed-camera or active-vision system will typically incorporate a method of object identification, but the methods themselves typically assume that only a single, static image is available. After the object is identified in the recorded sensor data, it must be tracked over time. Object tracking may take many forms, depending on the application. The simplest forms of tracking simply follow the motion of the region representing the OoI in the input data, such as 2-D region-of-interest tracking algorithms for computer vision (e.g., Optical Flow, [29]). These methods can be extended to determine the position of the object in some real-world coordinate system by adding additional constraints to the above tracked positions. For example, a priori known object shape parameters are

31 6 used in [30] to track the 3-D world coordinates of multiple geometric primitives composing a more complex overall model. Such methods tend to be inapplicable to TVG objects, due to their varying appearance, so generalized triangulation-based methods (e.g., [31]), were later proposed to address this issue. Full 6-dof tracking can also be performed by adding a priori knowledge of the object s nature (e.g., known feature points, [32]), or by generalized descriptors (e.g., [33]). As with OoI identification, these methods form a distinct research area, wherein it is typically assumed that input data is fixed (i.e., the sensors are static). Active-vision systems require a method of OoI tracking, but most existing methods cannot make use of the active aspect of such sensing systems Object Recognition The next logical step, given that a system is able to identify and track a TVG object, is to recognize the object from a library of alternatives. Many established methods exist for identifying and classifying fixed-geometry objects [34]. These methods have been applied to faced-based identification of humans (e.g., [35]), shape identification [36], and other varied tasks. The proposed solutions typically range in complexity from simple, 2-D images based methods (e.g., direct imagebased Principal Component Analysis, or PCA, [37]), to complex 3-D mesh-model methods which compare recovered 3-D polygonal meshes (e.g., [38]). These methods inherently assume that the OoI s appearance is fixed, invalidating their use for sensing TVG objects. As such, TVG object recognition is a complex task. Any proposed method must be able to identify the OoI, regardless of its current form. Initial methods extended fixed-geometry methodologies by mixing the subject and action library. In [39], combinations of key subject form snapshots from multiple actions are represented as distinct entries in an action library. This had the negative effect of greatly increasing the library size necessary to recognize multiple objects. Modern methods typically recognize the object as a side-effect of action recognition. If the system can successfully extract the information needed to recognize the OoI action, and an action is positively recognized, the OoI s identity is inherently assumed to be known (i.e., inductive reasoning). This concept can be extended for some types of TVG objects, as it may be possible to recognize a sub-class of TVG objects by their actions alone. For example, one of the earlier human-gait recognition works advocated that it might be possible to uniquely identify an individual based on their gait (walking motion) [40]. Research in this area began with algorithms designed to distinguish the current form of a human given a single image [41]. Using key-point markers, it has been shown that the gait of an individual can be uniquely distinguished at a rate above that of random chance [42]. The research results reported in [43] showed that automatic face and gait recognition can be combined,

32 7 using decision-level data fusion, for human identification. However, as with OoI identification and tracking, these works still considers sensor poses to be an unchangeable constraint Action Recognition The penultimate task for this research field is the complete recognition of the actions of the OoI by an automated system. The exact representation of what constitutes an action strongly depends on the application. A TVG object may perform more than one action simultaneously [44]. In some cases, it is beneficial to attempt to recognize the atomic actions in this situation, and in others it is simpler to recognize distinct combinations. Regardless of this representation, the basic goal of TVG object action recognition is to positively identify a specific, continuous, and finite sequence of OoI forms from the continuous stream of OoI forms. Specifically, many TVG objects exhibit repeatable, or even periodic, sequences of reconfiguration motion that one might wish to recognize [45]. According to [46], one can classify approaches for recognizing these actions into three general categories: Template Matching, Semantic Approaches, and Statistical Approaches. Template-Based Action Recognition Template matching can be seen as the simplest approach to motion recognition. Input images are compared directly to stored templates, and multiple key forms or images form a sequence template, which can be stored in a library. A metric, such as the Hamming distance (e.g., [47]) or normalized Hamming distance (e.g., [48]) can be used to establish a positive match from the image data to one of the known templates. Modern silhouette-based methods are also known to fall within this category. For example, in [49], simple template matching is performed using a database of views captured from multiple cameras. The Chamfer distance is used as the distance transform in this work, and it has been shown to work in the presence of a cluttered background using a single camera with no occlusions. As an extension, [46] proposes the use of a simple template matching method as an initial criterion. Background subtraction and a thresholding by distance are first used to produce a silhouette. After classification is performed to reduce the search area, the method uses high-level auto-correlation, or HLAC, to extract a set of pose-invariant features to match to a database. Semantic Action Recognition Since matching templates or silhouettes of objects can prove to be computationally expensive, research has also focused on semantic approaches. These can be seen as analogous to template matching, except that the data used in the templates is high-level object configuration data, such as

33 8 the position and angle of joints (if present). In essence, these are model-based approaches, in that a high-level representation of the OoI may be constructed (or merely represented). According to [46], this method is most applicable to inherently complex objects, such as artificial objects, or the human form. Indeed, the majority of past action recognition methodologies have avoided these approaches due to this fact. As an example, [50] presents a model-based method for recognition of human gait. The geometry of a human body (specifically the leg joints involved in walking) is recovered over several frames, and in the presence of occlusions. The paper concludes that it is possible for a featurebased method such as this to be used for gait recognition, but it will be subject to noise during extraction of features. Signature size, a key factor for a recognition database, is inherently shown to be significantly reduced, as well. Statistical Action Recognition The final approach type, statistical approaches, can be seen as an extension of both previous works, in that they attempt to reduce the dimensionality of the matching through statistical operations on the template database. For instance, in [51], the authors seek to identify what particular information is most important in identifying a positive match to a template. Using analysis of variance (ANOVA), according to [51], helps to identify features that highlight differences between subjects. Then, using PCA, they are able to reduce the data set to a lower dimensionality for matching. More complex methods may be used in place of PCA for matching high-level features, such as Fourier Descriptors [52], Linear Discriminant Analysis (LDA) [53], and Coupled Subspace Analysis (CSA) [54]. 1.4 Sensing-System Reconfiguration Given the available methods to recognize a TVG object and its actions, the natural question is: How might one increase the performance of the sensing task further? In general, the above methods are limited in performance for generalized TVG object sensing by the real-world effects identified in Sections (1.3.1) and (1.3.2). Some methods (e.g., [55]) have achieved performance that is superior to generic methods by focusing on a more narrow sub-problem, but no single, unified method currently exists that can offer acceptable performance in all applications. For example, most current methods can be easily out-performed by a human observer. Many methods have been proposed to improve sensing task performance. Computer vision research has focused on improving the robustness of the above algorithms to the real-world sensing issues identified early in Section (1.3). For example, action recognition methods which are robust to temporary occlusions (e.g., [56]), form extraction noise (e.g., [57]), and scene clutter (e.g., [58]) have been developed. However, such methods are inherently limited in their ability to improve sensing task

34 9 performance, as there is a finite amount of useful data that can be extracted from a given set of input data. At some point, placing additional effort into this process will simply not increase performance further. As such, it has been proposed that it may be beneficial to first focus on improving the input data available to a sensing system [59]. In this manner, one may reduce the amount of sensing effort required to achieve a given level of performance, and possibly expand the upper bound on performance. One common method which has been proposed to improve sensing data is termed sensing-system reconfiguration (or, sensor planning in some literature). Sensing-system reconfiguration is defined as the selection, through a formal method, of the number, types, locations, and internal parameters of the sensors employed in the surveillance of an object or subject [59], [60], [61]. By examining the environment and selecting desirable sensor parameters, the system can reduce the uncertainty inherent in the sensing process, and thereby improve the performance of the surveillance task. The focus of this work is sensing-system reconfiguration, restricted to the on-line modification of sensor poses (position and orientation) in a multi-camera active-vision system, for the real-time surveillance of time-varying geometry targets (such as humans), in dynamic, cluttered environments. In general, reconfiguration methods may be either off-line or on-line, although a complete sensing solution will typically require both types. For completeness, there other most common method of improving input sensor data is to use additional, static cameras. This introduces a trade-off between the incremental cost of an additional camera and the additional benefit to the given sensing system. Given the number of variables that determine performance, it is typically not possible to directly characterize this trade-off at design time. As such, it may be difficult for a system designer to select an appropriate number of sensors for a given task without trial and error. The cost of an additional camera may also be difficult to quantify (i.e., in-body surgical applications and robotic surgery). In general, there will be situations where the number of static cameras needed to achieve a desired level of performance is impractical when compared to a smaller number of active cameras. Other light-weight solutions, such as semirandom or pattern-based reconfiguration have also been proposed, although results are often poor or sporadic [61] Off-line, Static Environment, Fixed-Geometry Reconfiguration Off-line planning methods tend to focus on selecting the number, type, and initial layout of sensors, given a certain task specification. The core concept is that for a given sensing task, one or more desirable combinations of sensing hardware should exist which maximize the chance of success in a given task [61]. Off-line methods seek to discover these combinations, regardless of if sensors will

35 10 later be moved (as in active vision), or will remain static. In the case of active vision, however, the methods can be expanded to also determine the movement capabilities of sensors in the system [62]. Although some of the earliest works in general sensing-system reconfiguration are primarily offline methods (e.g., [63]), research has continued in this area. The benefit of on-line reconfiguration is inherently limited by the capabilities of the sensing system [64], so determining a suitable off-line hardware configuration and initial sensor poses is still crucial for all modern sensing systems. However, it has been found that it is difficult to create a method which produces optimal configurations for a wide range of tasks [65]. Indeed, many modern off-line methods are highly application specific (e.g., [66] and [67]). There is also debate about what exactly constitutes an optimal configuration in this context [68], as overall sensing-task performance is highly dependent on many other factors. For the experimental set-up that will be presented in later chapters, the method used in [61] was implemented. This method is typical of a new generation of average-case methods, which relax the requirement for optimal configurations. Such methods attempt to determine hardware configurations which are expected to yield desirable average-case performance in a wide variety of sensing tasks, rather than optimal performance for a very narrow set of tasks. However, for the purpose of this dissertation, there is no one correct method. The focus of this work, and its novelty, lies in on-line reconfiguration of cameras in response to dynamic stimuli. Past research has also focused on on-line reconfiguration, although initial efforts are restricted to much simpler problems by necessity On-line, Static Environment, Fixed-Geometry Reconfiguration The earliest work in sensor planning, and sensing-system reconfiguration, focused on determining the configuration of a given set of sensors (with known capabilities) for a static environment and objects with a fixed (non-time-varying) appearance/configuration. In [59], most existing work was characterized as either generate-and-test, or synthesis. A generate-and-test method evaluates possible configurations with respect to task constraints, discretizing the domain to limit the number of configurations that must be considered. An established example of such a method is the HEAVEN system [69], which uses a discretized virtual sphere around the OoI to determine un-occluded, achievable poses for a single sensor. In [70], a similar method is used, except the OoI is also discretized, to determine the poses necessary to guarantee viewing of the entire object. Synthesis methods, on the other hand, characterize task requirements analytically, and determine sensor poses by finding a solution to the set of constraints presented. These systems are often application-specific, as in [62], where the planner synthesizes a region of viewpoints by imposing 3-D

36 11 positional bounds from task constraints. Other examples include [71], where points on the outer edge of an OoI form a virtual box, which the camera must be positioned to view, while minimizing the local distortion in the image. These methods, despite being developed early in the research process, are still used in more current methodologies. For example, in [60], sensor reconfiguration is used as part of an overall determination of sensing strategy (including number and type of sensors, as well as placement). The analysis involves determining face contact relations for all objects, and modeling these as constraints. Using these constraints, along with the basic motion constraints, a synthesis method determines the optimal pose through intersection. Another example is [63], where sensor planning is used for mapping a static environment. Solid models from an incremental modeler are used to compute future positions to allow sensors to better explore the environment. A synthesis method is used, which acts on constraints given by the contiguous volumes of unexplored environment area. Finally, in [66], an agent-based system is proposed to allow intelligent feedback between multiple sensors, reducing redundancy in the data and improving the surface coverage of the OoI. The scene is static, and individual sensors use a generate-and-test method to discretize and evaluate potential viewing positions. In general, sensor planning in static environments has evolved into mostly application-specific planning algorithms, designed to determine long-term placement for fixed sensors Dynamic Environment, Fixed-Geometry Reconfiguration Sensor planning for dynamic cameras surveying an otherwise static scene has become a primary interest of computer vision specialists studying image or scene understanding [72]. It is a direct extension of the static environment sensing problem examined in Section (1.4.2), wherein the only moving objects in the environment are the sensors themselves. Many modern examples of such systems exist, as they are often used in computer graphics and other similar applications (e.g., [73], [74]). In [75], for example, a single dynamic camera recognizes a static 3-D object by on-line selection of poses which maximize dissimilarity among the set of candidate objects. Such works introduce the NBV problem, where an algorithm attempts to recover the maximum possible amount of unknown information about a subject with each movement of the sensor system [24]. However, none of these works directly address TVG objects as mentioned above, the issue is that actions by the target introduce non-uniform importance of viewpoints over time. Namely, in maximizing the amount of unknown information recovered by the system, one must consider that choices made at past instants,

37 12 plus the current action of the target, will now affect maximum possible amount of unknown information that is available [76]. Rather than immediately address this complex problem, this research focused on increasingly dynamic environments which still contain only a single, fixed-geometry object. A natural extension to the early static environment sensing problem was the addition of moving objects, obstacles, and sensors, as well as the possibility for continuous reconfiguration of sensor poses. For example, in [77], an 11-camera system was used to examine the effects of viewpoint on recognition rates for human gait. It was determined that the traditional static, off-line camera placement (with the camera at 90 to the expected walking direction) used in single-camera systems such as [46] will lead to poor performance for many sensor and OoI pose combinations. It was suggested that the generation of an on-line configuration for multiple cameras using data fusion could vastly improve upon single-camera algorithms. As a result, work began on algorithms to assist existing computer vision algorithms in dealing with a moving target. For example, dispatching can be used in a single-target, un-occluded environment problem to select and position a group of sensors in real-time, without requiring any previous knowledge of the target's motion [78]. It is also interesting to consider the case where the target is not just moving, but maneuvering the target may be evasive or possess movement capabilities superior to the dynamic sensors. In [79], a set of mobile robots are positioned to maintain visibility of one or more evasive targets by maximizing the time it would take a target to escape from view (shortest distance to escape), given a known set of environment obstacles. Other early methods, such as [80], discretize the workspace into sectors, and assign sensors once an object enters a given sector. It is also necessary for a sensing system to intelligently address obstacles (and by extension, occlusions) in the environment. In a given sensing environment, there may be multiple static or dynamic obstacles present that impede optimal viewing of the OoI. For example, early systems such as [81] provide an agent-based sensing method, wherein each sensor path is determined independently by using triangulation to avoid obstacles. A similar method, proposed in [82], uses negotiation between these agents to maximize the amount of a target that can be observed at a given point in time. Other works, such as [83], addressed the case of multiple static obstacles using an agent-based approach. These works often address static obstacles via multiple mobile sensors, as in [79] where on-line sensor positioning allows for the surveillance of maneuvering targets. Dynamic obstacles require a detection step (assuming no a priori knowledge) and path prediction, as in [78], where an on-line method selects poses which maximize the expected visibility of a subject over the span of a rolling horizon in the presence of multiple, dynamic obstacles. Multiple dynamic obstacles

38 13 have also been considered in [83]. In the case of a fully dynamic, multi-obstacle environment, where the obstacles themselves may also be moving or maneuvering, more advanced methods are often necessary. For example, [84] presents a method for controlling a team of mobile robots equipped with cameras (active sensors) to optimize the quality of the vision estimates. The problem is addressed mainly as a simple optimization problem, with consideration to the next-best-view (NBV) problem. In this system, the next view is chosen, given previous scans, to best capture the remaining un-viewed surface geometry of the object. All of the above methods inherently assume a uniform target recent methods have considered that a static, articulated or otherwise concave object may self-occlude, and as such, the next-best-view problem might be solved online as part of the sensing solution [85]. However, these methods still do not address TVG objects Multi-Target, Dynamic Environment, Fixed-Geometry Reconfiguration The most general case of the fixed-geometry sensing problem includes a provision for multiple target objects. Some works addresses multiple dynamic OoIs by using attention-based behavior (e.g., [86] and [87], where the 'attention' of the system is focused on a single target until the vision task is performed. This allows the sensing system to essentially reduce the sensing problem to a single-target task. With each iteration of the reconfiguration algorithm, the system will focus its attention on a single target, and marks all other objects as obstacles (e.g., [88]). Other methods propose a more dynamic solution, where attention can be divided between multiple targets if it is beneficial in reducing the time it takes to sense all targets completely (e.g., [89]). In another example, [90], an agent-based system, using attention-based behavior, is able to intelligently select targets under a set of global rules to ensure all targets are 'serviced' within the time requirements given. However, in all of these algorithms, each target must be serviced for only a finite amount of time before sensing is complete, and the system can effectively ignore that object. As identified previously, TVG objects exhibit actions which are continuous in nature; they must be constantly sensed, otherwise data is lost. As such, attention-based methods are not applicable to the recognition of TVG objects and actions, as a target cannot be sensed and forgotten. Due to this need for continuous sensing, multi-target active vision for TVG objects is considered to be a distinct problem from multi-target, fixed-geometry sensing and from single-target TVG sensing. The continuous surveillance of multiple human subjects (potentially, with other occluding subjects that are not of interest) requires a solution to a resource distribution problem which is outside the scope of this work. Not only must the limited sensing resources of the system must be allocated continuously to maintain surveillance of all subjects of interest, but the list of subjects and/or the

39 14 sensing goals may need to be adapted online to suit the availability of resources and the current environment state. To address this problem, as for fixed-geometry objects, it is first necessary (in this work) to examine a suitable solution to the single-subject problem before adding resource assignment to the task at hand. A more detailed discussion of this future work is presented in Chapter Static Environment, Single TVG Target Reconfiguration The next logical extension of this sensing reconfiguration problem is the on-line system reconfiguration of a system sensing a single TVG object and its actions in a static environment. Early research considered only a static environment and a single target to reduce the complexity of the task at hand. However, very few algorithms exist that address even this reduced problem. Of these algorithms, some attempt to apply past research in sensing fixed-geometry objects directly to TVG objects. A rare example is [91], where a rolling time horizon is used with a constrained, decision-time-aware search method, allowing hard deadlines to be enforced during realtime, on-line static environment reconfiguration. The reconfiguration method itself, in this case, assumes a uniform representation of the OoI, rather than a true, articulated model. The remainder of these static environment methods often make significant assumptions of known information about the sensing task. For example, in [92], a multi-camera system is used to survey walking humans moving along an a priori unknown paths. In this case, the OoI action, the locations of all obstacles, and the environment itself, are assumed to be known. This significantly simplifies the sensing problem, allowing an algorithm to more easily achieve optimal or near-optimal sensing-task performance. However, no algorithm currently exists which addresses real-time, on-line selection of sensor poses for single-target TVG action recognition in a static environment Dynamic Environment, Single TVG Target Reconfiguration The final extension of the reconfiguration problem, and the focus of this research, is the consideration of a single-target, dynamic environment. This problem re-introduces all of the real-world sensing issues which a system must address in a real-world application. To achieve the best possible sensingtask performance in any single-target environment, a system must directly address as many real-world sensing issues as possible [76]. To yield a tenable problem, this work will be restricted to singletarget environments. While multi-target environments will eventually be considered (as discussed in Chapter 7), they are outside the scope of this research, considering that no current method even exists for single-target action recognition and reconfiguration. Given the above literature, the overall problem at hand is now clear there is a clear division in past work. Most existing methods for TVG object and action recognition assume input data (as

40 15 determined by sensing system configuration) to be fixed, with no opportunity for improvement. Sensing system reconfiguration, on the other hand, has previously been applied to fixed-geometry tasks to tangibly improve sensing-task performance, but no current method exists which directly addresses the issues inherent to TVG objects and their actions. Thus, the focus of this work will be the development of a method of sensing-system reconfiguration that is applicable to a wide range of TVG object and action recognition tasks Applications Human Action Sensing Before defining the research objectives, it is beneficial to examine the depth of the TVG object and action recognition tasks mentioned above. In particular, this section will examine one the most common application areas for TVG action recognition human action sensing. As identified in the introduction, humans exhibit a wide variety of actions that one might wish to recognize. Humans use actions to convey information in and out of conversation, and to interact with their environment. As such, significant research effort has been expended to develop systems to recognize all manner of human actions (e.g., [93], [94], and [95]). This work will focus on three subsets of human actions which constitute a balance of typical human sensing tasks. Body Motion and Gait In past computer vision research, gait recognition, or the recognition of the distinct sequences of movement a human may perform, has been a key topic of interest - one of the most common reasons is for biometric identification of individuals. As mentioned above, in the early 1970's, it was proposed by [40], [42] that it might be possible to uniquely identify an individual based on their gait (in this case referring mainly to a walking motion). In addition to biometrics, however, the recognition of individual human actions may also be useful in communication. For example, [96] presents a method using automated segmentation and Markov chains to identify and classify several distinct motions from a given sequence. Recognizing whole-body actions can be used to facilitate human-machine interaction, as in [97], where a human s commands are detected based on body language. Body motion can also be used as a machine learning tool, as in [98], where a humanoid robot uses a visionbased approach to mimic a human teacher s actions. Hand Gestures Given that sensing-system reconfiguration could be applied to whole-body motion, it is logical to extend the problem to another important communication method hand gestures. From [99], it is known the key factors for a successful application of reconfiguration methods are present. Hand

41 16 gesture recognition typically is highly dependent on clear, un-occluded views of the subject s hands, and the best viewpoints are strongly affected by the particular gestures one wishes to recognize [99]. Hand gestures are also shown to suffer from greatly increased self-occlusion, and are also more prone to being occluded by the scene or other parts of the human OoI, due to their relative size [99]. Many current approaches use phenomes, or small 'syllable-like' poses that can be individually recognized [100] to simplify the problem. If one were to minimize self-occlusion, then more powerful modelbased approaches would be viable [99]. Thus, there is potential for the application of sensor reconfiguration here, as well. Facial Expressions Extensive work has been performed in the field of facial recognition, for the purpose of identifying human subjects. However, as identified in [101], there is a wealth of information for interaction contained within facial expressions as much as 55% of a face-to-face message may be conveyed by one's expression. In addition, it is known from [102] that expression recognition is sensitive to the same factors as face recognition occlusions, viewing angle, image quality, etc. However, [102] also mentions that TVG-specific factors also affect recognition performance, albeit in a different manner from hand gesture and body motion recognition. From the previous literature review, it is obvious that sensor reconfiguration could be applied here, although it will later be shown (in Chapter 2) that this application belongs to a related, but subtly different problem sub-class from hand gesture, body motion, or gait recognition. 1.5 Research Tasks The primary goal of this research is the development of a novel, generic method of sensing-system reconfiguration, which is specifically designed to sense TVG objects and actions in a real-world, cluttered, dynamic, single-target environment, in the presence of multiple static or dynamic obstacles, and in real-time. More generally, the objective is to demonstrate that by improving input data through sensing-system reconfiguration, one can reduce the effort required to achieve a given level of sensingtask performance, and potentially increase the maximum possible level of performance, as well. This goal has been achieved in the past for fixed-geometry objects, but it remains an open question if these conclusions are also applicable to TVG objects. The tasks set to achieve the research objective in terms of recognizing the actions of TVG objects are: 1. Recognize a single action of a single TVG object in a real-world environment containing multiple static or dynamic obstacles.

42 17 2. Verify that sensing-system reconfiguration can tangibly improve action-sensing performance, given the sensing task in Task (1). 3. Recognize multiple actions, including multiple simultaneous actions and multiple actions occurring at differing levels of detail, performed by a single TVG object in a real-world environment containing multiple static or dynamic obstacles. 4. Verify that the any conclusions derived from Task (1) are still valid for the system and environment developed for Task (3). In particular, the research should confirm a tangible increase in sensing task performance over static cameras for both sensing tasks. 5. Recognize single and multiple actions of multiple types of TVG objects in a real-world, multi-obstacle, cluttered environment in real time. 6. Verify that the real-time sensing system proposed to achieve Task (5) offers the same tangible increase in sensing-task performance as previous methods, and characterize this realtime performance. 7. Provide a complete, generalized framework that system designers can use to develop and evaluate sensing-system reconfiguration solutions applicable to a wide range of target tasks, environments, and sensing hardware. The methodology should be as comprehensive as possible, allowing one to develop complete sensing solutions tailored specifically to sensing TVG objects and their actions. 8. Demonstrate, through a completed application of the framework developed to achieve Task (7), that implementations of the proposed method still meet all other research objectives. Together, these research tasks define the scope and depth of the work to follow. It is important to note that the focus of this dissertation is the novel sensing-system reconfiguration methodology that is presented first in Chapters 2 and 3 (Tasks 7 and 8), and the detailed experiments and validation in subsequent Chapters 4, 5, and 6 (Tasks 1 through 6). Although significant additional information is necessary for a complete framework (see Appendices), the core optimization and method of selecting sensor poses is the always main contribution of this work. 1.6 Summary of Contributions It is also important to clearly define the central contributions of this work. As stated in the previous section, the principal contribution of this work is the novel method of sensing system reconfiguration which is specially targeted at sensing a single TVG object. The following Table 1.1 summarizes the key contributions of this research and categorizes them as principal, major, or minor contributions.

43 18 Task Detection Tracking and Prediction Reconfiguration Recognition Real-time Operation Robustness TABLE 1.1 SUMMARY OF RESEARCH CONTRIBUTIONS Description of Task All objects in the workspace must be detected and categorized as either the subject, or an obstacle, upon entering the workspace. Objects may enter or leave the workspace at any time. The work will assume a single subject. However, multiple obstacles may be present. Each object within the scene must be tracked to provide future estimates of pose whenever necessary. Additionally, if the subject is detected as performing any action (known or unknown), the system must be able to provide predictions of future subject forms. Given historical, current, and predicted data about the OoI and any obstacles, an achievable set of poses for all sensors must be selected. These poses should be selected by the system to globally maximize performance of the recognition task over the span of a rolling horizon. Data from all sensors must be fused into a single estimate of the object s current geometry. A further estimate must reconcile this geometry data with historical data to determine the current action of the target. All operations must be limited in computational complexity and depth, such that real-time operation is not compromised. The methodology should be designed with real-world implementation in mind. The system must be robust to faults, and the likelihood of false identification or classification must be minimized. It must be able to operate under real-world conditions. Expected Research Contribution Minor Minor Principal Major Major Major The first two contributions of this work, in detection, tracking, and prediction, are labeled as minor, as well-developed methods are used within the proposed framework. As such, the key contribution is to demonstrate that existing methods (with some modification) can be used within this framework to successfully detect, track, and predict the poses of TVG objects and obstacles. In Chapter 5, a novel method of OoI detection, based on interest filters, is also discussed as an application of this system. For recognition, the key contribution of this research is a novel method of multi-camera data fusion designed specifically for TVG object and action recognition. This method is designed specifically to leverage the multi-camera, active nature of the proposed methodology. It is also designed to operate in a real-world, real-time environment. Although similar methods exist for fixedcamera and/or fixed-geometry objects, no method exists which specifically addresses the environment considered in this research, making the proposed method a major contribution of the work. Real-time operation is also a major contribution of this research. Previous methods of TVG object action recognition and fixed-geometry sensing-system reconfiguration often consider real-time, realworld environments in a secondary manner. This work presents a customizable methodology that is designed from the beginning with real-time operation in mind. The concept of real-time operation and

44 19 the overall characterization of system performance and task success is also formalized, itself an important contribution. Real-world operation and robustness also marks an important contribution of this research. Past methods have addressed real-world operation in a secondary manner, often through informal testing. The proposed novel framework is designed to be customizable to a wide range of real-world applications, and is created specifically with real-world operation in mind. Chapters 5 and 6 present real-world trials involving varied single and multi-action sensing tasks which are then evaluated using a formal and repeatable methodology. These trials mark two important contributions of this work a novel, real-world method of reconfiguration for TVG objects and actions, and a novel, formal method of testing and evaluating such systems. Finally, as mentioned above, the principal contribution of this work is the novel method of sensing-system reconfiguration. At its heart is the real-world performance optimization formulation, to be discussed in Chapters 2 and 3, which formulates the problem at hand as a mathematical optimization which can be implemented and solved by an automated system. This problem formulation, and the customizable framework designed around it, allows one to implement sensingsystem reconfiguration specifically for TVG action sensing. As identified earlier in this Chapter, there are currently no other methods available that address this problem. The closest equivalents are either fixed-geometry methods of reconfiguration applied to TVG objects, or fixed-camera TVG sensing methods. As such, this contribution must be highlighted as the central novelty of this work. 1.7 Dissertation Overview This dissertation presents a novel method of sensing system reconfiguration designed specifically for systems sensing a single TVG object and its actions in a cluttered, real-world environment. In accordance with the research tasks outlined in Section (1.5), this dissertation has been divided into chapters based on logical divisions in the listed goals. In particular, the goals define a series of increasingly difficult sensing problems and subsequent verification tasks, which lend themselves to an iterative design process. As such, this dissertation will present three related, novel methods of system reconfiguration for TVG objects, each tailored specifically to meet one or more of the research tasks above. These methods are directly related, as each method will be created through iterative improvement of a previous method. For the first method, past work on fixed-geometry objects will be used as a baseline to begin this iterative process. It is important to note, however, that all methods being presented are closely related, and all methods are novel. Furthermore, even though each subsequent design iteration can effectively replace its predecessor method, all proposed

45 20 methodologies will be designed to remain useful for situations where the additional complexity of the new method is not needed by the system designer. As such, this dissertation is organized as follows: Chapter 1 This Chapter presents a detailed overview of the fields of TVG action sensing and sensing-system reconfiguration, and introduced the core problem of this work. Chapter 2 The second Chapter will develop the problem of sensing-system reconfiguration for TVG objects and actions into a mathematical formulation which can be directly solved by a sensing system to achieve pose selection. A number of key concepts will be defined in this chapter, including the core pose-selection optimization, the articulated object representation, and the visibility metric and its relation to sensing-task performance. Chapter 3 As a continuation of Chapter 2, this Chapter will present a complete, generalized, and customizable novel framework for real-time sensing-system reconfiguration. This will include a description of the chosen real-time architecture, as well as detailed descriptions of the functionality and theory for each sub-part. The Chapter will also present a detailed guide for customization of the proposed methodology to a chosen application. Chapter 4 Given the complete proposed methodology, this Chapter defines a simplified methodology (the immediate successor of past fixed-geometry methods) which will perform singleaction TVG recognition. This novel methodology is developed through a detailed simulation environment and simulation-based experiments. The basic assumptions of sensing-system reconfiguration are evaluated through rigorous testing in a simulated environment. Chapter 5 Using the previous methodology, Chapter 5 presents an iteratively re-designed methodology for multi-action and multi-level action recognition, which implements improvements and changes identified in Chapter 4. In this case, real-world, quasi-static experiments are also presented to evaluate the proposed methodology, compare conclusions to past results, and identify further areas for improvement. Chapter 6 This Chapter will revisit the customizable, generalized framework presented in Chapters 2 and 3. It will focus on the iterative re-design of the methodology from Chapter 5 to create this novel framework, and on validation and characterization of the method through real-world, realtime experiments. Chapter 7 The final Chapter will highlight the contributions of this work, and will also include recommendations for areas of future research in the field.

46 2. Problem Definition Given the past research reviewed in Chapter 1, it is pertinent to formally define the task at hand. As identified in the literature review, human action sensing is a widely-varied subject area. Numerous methods have been developed to track subjects, and recognize humans and their actions using static and dynamic sensors. This task can be complicated by additional factors, such as a time-varying geometry (TVG) object/subject, or a dynamic environment, to yield a more difficult sensing problem. Performance-enhancement techniques were also identified; methods may be applied to the sensing problem at hand to improve the system s performance or success rate. One such method, sensingsystem reconfiguration, formulates the sensing problem as an optimization, maximizing performance by varying sensor parameters and layout. However, it was identified that there is a core division in the literature. Modern action recognition techniques generally assume fixed cameras, fixed input data, and near-ideal conditions. Sensing-system reconfiguration techniques typically address fixedgeometry objects, or apply fixed-geometry methods to TVG objects, ignoring the factors specific to these objects. As such, the core problem addressed by this dissertation is to bring these two fields together, in order to develop a sensing-system reconfiguration methodology suitable for general TVG object action recognition. Inherently, this includes human-action recognition. With this goal in mind, this Chapter will begin with a qualitative examination of the two tasks at hand, Section (2.1). After identifying the key tasks, Section (2.2) and Section (2.3), a mathematical formulation of the problem will be presented, Sections (2.4) to (2.6). A baseline optimization process, suggested specifically for solving this problem within the framework to follow, will also be developed. 2.1 Overview From the literature review, two distinct tasks were identified: (i) TVG object/subject action recognition, and (ii) Sensing-System Reconfiguration. This Chapter will begin by reviewing the representation of an action for a TVG object. The following Section (2.2) will outline the sub-tasks necessary for a system to recognize these actions TVG Object- and Subject-Action Representation As identified in the literature review, the basic task at hand is the recognition of a single action performed over time by a TVG object. Typically, an action constitutes a case where an object moves or deforms as a continuous function of time. Describing an action mathematically inherently assumes 21

47 22 a choice of model representation. As will be shown later, this representation is able to be quantized in a manner that a sensing system can implement. An object may, in the most general case, be modeled using a volume equation, which defines the volume the object encompasses [103]. For a non-convex object, it is inherently difficult to express the volume equation in closed form. Thus, it is common to break non-convex objects into subparts which are convex [104]. This allows a representation given in Equations (2.1) and (2.2) below. (2.1) (2.2) In Equation (2.2), the left-hand-side of the equation is the Euclidian distance of a point, with 3-D coordinates,,, to the object s center (typically, the center of mass, or CoM), with coordinates,,. Equation (2.1) defines the normal direction of the point,,,. Thus, the right-handside of Equation (2.2) is the Euclidian form of the surface equation of the object, or its external depth map,... These properties are shown graphically in Figure 2.1. FIGURE 2.1 GRAPHICAL VIEW OF THE SURFACE EQUATION FOR A VOLUME It is assumed that objects being sensed will be limited in their deformations to pure deformations of structure; a constant surface area assumption is imposed [105]. Surface area is conserved, and is neither created (i.e., through cutting), or destroyed (i.e., through joining, although tangency is allowed). This is a common method of simplifying the problem at hand; in cases where this assumption is violated, additional reasoning about the scene above pure action recognition is likely to be necessary. As such, such cases are best modeled as a separate, but related, problem sub-class. Deformable objects tend to exhibit more applications where this assumption is invalid (i.e., cutting or machining of a part). Also note that this is a fine distinction when multiple objects are involved; for

48 23 example, if a human picks up an object, one may model it as a new, combined human/object hybrid (invalidating this assumption), or one may retain two separate object models. Furthermore, it is assumed that the sensors are limited to surface sensing i.e., volumetric information about the object will not be available. Given these two assumptions, only the external surface area of the object is of interest and, thus, this work need only be concerned with the depth map of the object. While this most general case can represent any deformation of a convex object, it is too general to form an action-sensing system around. The general form of an action is given by Equation (2.3):, (2.3) The depth function above is extended as a continuous function of time,. However, not all of the surface area will be in motion at once. Some sections may be constant over all actions the object can perform, i.e., rigid sections may exist [104]. Typically, an action is defined and identified by those parts which vary over time in a predictable manner. For example, a human walking motion has rigid segments (i.e., limbs, torso, etc.), and geometry which varies (i.e., the leg and arm joints, primarily). However, clothing and skin may deform as well, but not in a repetitive manner these deformations of the object surface are not part of the action, and can be considered as noise when recognizing it. As such, two subclasses of the TVG action recognition problem can be defined: deformable object actions and articulated object actions [106]. An overview of the differences is shown in Figure 2.2.

24 FIGURE 2.2 OVERVIEW OF DIFFERENCES IN ARTICULATED AND DEFORMABLE OBJECTS Let us consider two objects, each with an evenly-spaced mesh spread over the surface area of the object.

This work defines a deformable object as one where each action s variance is relatively evenly distributed over many or all of the points in this mesh.

49 24 FIGURE 2.2 OVERVIEW OF DIFFERENCES IN ARTICULATED AND DEFORMABLE OBJECTS Let us consider two objects, each with an evenly-spaced mesh spread over the surface area of the object. An action could be characterized by the positional variance at each point in the mesh over the course of the action. In this manner, the division between the problem subclasses becomes clear. This work defines a deformable object as one where each action s variance is relatively evenly distributed over many or all of the points in this mesh. Examples include human facial expression recognition, clothing deformation recognition, recognition of heart movement during surgery, and other cases where much of the object s surface area is deforming. On the other hand, an articulated object is characterized by large, continuous areas which have little or no variance over actions (i.e., rigid sections). Often, these objects exhibit positional variance only about certain articulation points, or

25 joints. A classical example is the human form, which possesses rigid segments (the limbs, torso, etc.) and articulation points (joints). As shown in Figure 2.

50 25 joints. A classical example is the human form, which possesses rigid segments (the limbs, torso, etc.) and articulation points (joints). As shown in Figure 2.3, these objects are often represented using a joint-skeletal model, as shown here for a human arm. FIGURE 2.3 COMPARISON OF OBJECT MODELS FOR A HUMAN ARM Under the above model assumption, a new representation for the action can be formulated. This work assumes the rigid segments to remain constant in shape and size. Obviously, this assumption cannot be perfectly satisfied in reality; most rigid objects will exhibit deformation while in motion, regardless of other properties. Thus, the method being developed must be robust to the noise generated by minor deformations. The above assumption is, thus, relaxed to allow deformation of rigid segments which is (i) small in magnitude relative the segment s size, (ii) not significantly correlated between actions, and (iii) low in variance across actions. In essence, the action can be completely represented by the vertices (joint positions) of the object. The connections (rigid segments) are taken to be constant, with additive measurement error to be addressed later. This new form is given by Equation (2.4): (2.4) In Equation (2.4), is an action feature vector, which is an aggregate of the positions of all joints in the articulated model. The frame of reference depends on the application. The functions,, and so on, are the time-varying 3-D,, positions of all joints in the model, for a model with joints. This action feature vector is a continuous function of time, and a single object may

51 26 perform multiple actions in sequence, or simultaneously. Actions are considered to be additive; multiple actions performed simultaneously are given as a direct addition [107], as in Equation (2.5): (2.5) In the above equation, two actions, and, are fused to create a combined action,. Recognition of multiple, simultaneous actions will be addressed later in this dissertation. It is important to note that this intuitive model does not imply complete linearity of actions; rather it simply states mathematically that actions can be combined to form a new, fused simultaneous action. It is also important to note at this stage that no generality has been lost by using an articulated skeletal model. Indeed, a deformable object can be modeled in this manner using a simple polygonal mesh [108], as seen in Figure 2.3. The polygons themselves become the rigid sections, and the vertices are the joints. Obviously, this is not the most efficient representation for such objects. This is why they are treated as a separate problem subclass; while the method outlined in this dissertation can be applied, and the sensing problem at hand is similar, the best results would be available through a method specifically designed for such objects. The reasons for this are many fold. For one, the relative tradeoffs between environmental factors (occlusions, relative viewing angle, etc.) tend to be very different. The off-line configuration of the sensing system (and, hence, the resultant camera/system capabilities) also tends to be different. Using an articulated representation for a mesh introduces a large number of dof, which need to be tracked and processed for sensor poses choices, complicating real-time operation. Thus, this work will focus on sensing articulated TVG object actions Action-Recognition Task and Quantization Using the representation of an action described above, it is possible to formulate a method to recognize an action from sensor data. At its core, this task will involve comparing a feature vector, measured from sensor data, to a library of known feature vectors. This is a non-trivial task, and many established methods exist which can perform this comparison (e.g., [109], [110], [111]). One issue which is common to all methods, however, is that sensor data is typically not continuous data realworld sensors must quantize that which they measure, and they do so at a certain sampling rate. As defined in Equation (2.4), the positions of each joint in the model are continuous functions of time. Practically, one can only measure these values at discrete instants. The system designer must decide whether or not to move this problem to the discrete-time (DT) domain, or remain in the continuoustime (CT) domain. It is not possible to recommend one method for all applications; as there are benefits and costs to both approaches.

52 27 If the problem is to remain in the CT domain, DT sensor data must still be used by the system. It is important to note that a CT action library is inherently difficult to capture and store [112]. The joint locations, as continuous functions of time, often do not have a closed form equation. For simple objects which move with highly predictable actions, it may be possible to express the action in closed form [113]. Other common applications include cases where the object/subject s actions are externally controlled or designed, such as a robotic arm in an assembly line (e.g., [114]). In these cases, the closed-form equations for the joint position functions,, can be determined through regression of the DT sensing data. By keeping a CT representation, the action feature vector may accurately represent known theoretical properties about the object, and this representation is intuitively understandable by humans. Regression methods also have significant theoretical background, can be made robust to input noise, and may be less computationally costly than other methods. However, most real-world objects do not have closed-form equations to represent their actions. In some cases, it may be beneficial to move the problem to a frequency-domain representation using the Fourier Transform (FT). By taking the Discrete-Time Fourier Transform (DTFT) of the sensor data, it may be possible to fit a closed-form equation to the resultant frequency-domain data [115]. However, in both cases, regression is used to reconcile discrete input data with a continuous, closedform equation. In the frequency domain, this is especially interesting due to the effects of sampling. For most real-world actions, the Fourier transform of any joint position function will yield a frequency domain representation that tends to approach zero as frequency increases [115]. If a suitable upper cut-off frequency,, is selected, above which the Fourier transform is taken to be zero valued, then most or all of the information in the signal can be preserved. Furthermore, according to the Nyquist-Shannon sampling theorem, if the DT data sampled by the system is sampled at a rate, such that 2, then, no information is lost in the sampling [115]. Indeed, the only stage where information is lost is when the original frequency-domain representation has its bandwidth limited. If the original function, naturally, has zero value after a certain frequency, it is possible to move the problem to DT without information loss. DT sampling and problem formulation will be used in the proposed method, and the specifics will be developed in detail in Chapter 3. It is important to note that this process does introduce new issues, as well. Aside from any information lost due to bandwidth limiting, the system must be able to sample at a sufficiently high rate. If not, aliasing, and other adverse sampling artifacts may occur. This representation is also less efficient for storage; large numbers of samples may be necessary. This means that some methods may also be more computationally costly than dealing with a simple, known CT equation. The benefits and costs of specific methods used in the proposed system will be discussed in Chapter 3. It is now useful

53 28 to examine the subtasks necessary to sense an action, given the representations and quantization schemes above. 2.2 TVG Action-Sensing Tasks As defined in Chapter 1, the task at hand is the recognition of the actions of a generic TVG object in a real-world, real-time environment. To recognize a TVG object action, the sensors must continuously measure the time-varying feature vector of the object. One assumes that sensors will collect estimates only at discrete times, with these discrete feature vectors given by Equation (2.6): (2.6) These discrete instants must follow a formal sampling scheme which prevents aliasing through appropriate selection of the sampling instants. Thus, the list of tasks necessary to recognize a TVG object action is given as follows: Detection The object of interest (OoI) must be detected whenever it enters the workspace of the system. The OoI may enter or leave the workspace at any time, although its trajectory will provide clues to the system about when these events may occur. This is a non-trivial task, depending on the model of the object. In particular, the level of knowledge the system possesses of the object is critical in determining the difficulty of this task [116]. If the object has readily detectable features, a known shape, and/or relatively invariant features, it may be possible to use classical feature point or objectdetection methods, such as those outlined in Chapter 1. However, in the most general case, a method based on more generic object descriptors, such as a statistical method (e.g., [117]) or interest-filter method (e.g., [28]), may be necessary. This is often the case for TVG objects, as their overall appearance may change significantly over the course of an action, and their current state and action is assumed to be unknown when they enter the workspace. All other moving obstacles in the environment must be detected as well, to allow the system to start and maintain a track of their 6-dof poses. The system can also potentially include a detection step to learn if new objects are added to the environment (effectively an identical case to a known object entering the workspace from an area outside its bounds). Tracking The 6-dof pose of the OoI must be tracked continuously as it moves around the workspace to provide a frame of reference, and as a critical input to the sensing-system reconfiguration method [90]. Similarly, to produce an estimate of the feature vector for a given instant in time, a continuous track of all joint positions must be maintained. This dissertation will

54 29 refer to the 2-D and 3-D locations of the model joints as feature points, or interest points. Individual feature points must also undergo a detection step, which again may be non-trivial. The view of the object may be partially occluded, or partially outside the workspace, meaning some joints will be visible and others not [118]. Tracking loss must also be detected, to prevent erroneous data from entering the action recognition process [119]. Estimation Given all detected and tracked feature points, the system must be able to reconstruct an estimate or snapshot of the feature vector of the object for the current time. This process involves several sub-steps. A sample-and-hold process is necessary to capture the initial data from the sensors. This process may be complicated if multiple sensors are used, as a time-alignment method is necessary to ensure that all sensor data represents the same instant in time [120]. If multiple sensors are used, data must be fused into a single, coherent estimate of the action feature vector. Producing this estimate may be complicated by missing feature points, outliers and noise, and incorrectly detected feature points [121]. It is also beneficial to the system to combine any known data or contextual information, such as model constraints, knowledge of the current action, and predictions of the feature vector, into the estimate [76]. The use of contextual information is often identified in literature as the key factor which separates human cognition of a scene from automated methods. Action Recognition Recognizing an action is also a multi-step process. Estimates of the feature vector arrive in a continuous stream, collected at designated instants. From this stream, the system must first detect whether the object is idle or performing an action. Once performing an action, the object may (i) continue to perform the same action, (ii) return to an idle state, (iii) change the parameters of the action it is performing, and (iv) transition to a new action. This process is shown visually as a state machine in Figure 2.4.

55 30 FIGURE 2.4 STATE-MACHINE REPRESENTATION OF OBJECT ACTIONS The conditions for transitions between states in this machine are not always clear-cut; in reality, the actions will tend to blend together [122]. One can define the transition to a new action to occur when either (i) the influence of the old and new actions are at 50% each, or (ii) the comparison metric falls outside an application-specific range for positive recognition of the old action. For the second case, where the new action is likely to be significantly different from the old action, the time between the two actions is modeled as an idle time. Even though an action is still being performed, the uncertainty is too high to positively classify it. Actions may also vary over their course, as in option (iii) in the state machine above. While some objects may exhibit perfectly repetitive motions, others may not. For example, human actions vary significantly each time they are performed, as well as between different subjects [38]. The system must be robust to these variations. The position functions that constitute an action feature vector may themselves vary between actions. Actions may be performed at different rates, lengthening or shortening the total time. After detecting a point of interest in the action stream (typically, the userdefined start or end of a distinct action), the system must also classify what action is being performed. Classification of an action requires an existing library of action feature vectors for comparison. The capture and storage of this library presents many challenges to the system designer [123]. Captured library data must contain minimal noise, and must accurately represent the action to be recognized. Representation format and library size is also important; a naive implementation may grow exponentially in size with new actions, and search time is proportionally related to library size.

56 31 One must also ensure that the library actions are representative of the average action. If an action is highly variable when performed by the object, it may be necessary to represent it as a family or spectrum of related actions in the library. Careful selection and division is necessary for best classification performance. To classify an action, the system must search this library and find the closest known action. In the most basic case, a decision is made by the following process, Table 2.2: TABLE 2.2- CODE LISTING FOR ACTION CLASSIFICATION Line For all actions in the library, 1, do: [1] 0, [2] If then [3] Accept classification of action as. [4] Else [5] Withhold classification. [6] End of loop. [7] In essence, the system must check each library action,, for a library with a total of actions, and evaluate a metric of comparison,. This function depends on the collected object feature vectors, 0, and the library poses,. If the evaluated metric value,, is below a maximum level,, then, the action is classified. Otherwise, a decision is withheld. This represents the most basic formulation of a classification method. The problem at hand is to implement this search in an intelligent manner; a naïve approach will yield long search times and poor recognition performance. Known information about estimates of the current action, or other contextual information, should be used to reduce the number of library comparisons whenever possible. The classification metric must be carefully chosen to maximize the inter-action variance in library space, and to minimize intra-action variance. More advanced tests than the binary yes/no test above may be implemented to further reduce the chances of false classification [124]. For all previous tasks, one must also consider the case of multiple, simultaneous actions. As defined in Equation (2.5), actions are considered to be additive. As such, one can always define any combination of two or more actions as a single, new library action. This approach is simple to implement, especially if the number of combined actions is expected to be limited. However, the library size can grow exponentially if all combinations of actions must be included. Generalized action classification through separation of component actions before classification requires a significantly more complex method.

57 Next-Best-View Problem and Sensing-System Reconfiguration Tasks As was noted in the previous section, the generic TVG object-action recognition problem is a complex task, with many sub-problems. One common factor to all of the above, however, is that recognition performance is inherently tied to sensor data, even at the most general problem level [125]. The action library is constructed from data captured from sensors. The classification metric takes a collection of sensor data as input. Objects are tracked and detected from sensor data. Although clever system design and operation can mitigate some problems and generally improve performance, sensor data is the determining factor for all tasks. As such, it is desirable to present the best possible sensor data to the action recognition system first, providing it with the best possible chance of success. This will be the primary, novel contribution of this research a sensing-system reconfiguration method designed specifically to improve TVG object action recognition performance. As for action recognition, reconfigurable sensing systems are also defined by a set of basic, component tasks which must be performed by the sensing system. This section will examine these tasks and how they are specifically related to TVG object action recognitions General and TVG Object-Specific Issues The core goal of sensing-system reconfiguration is to select camera parameters, both off-line and online, which maximize performance in the designated sensing task [59]. Off-line parameters include system set-up variables, such as the number of cameras, their placement, and their capabilities. Online parameters include camera variables, such as focus, zoom, and 6-dof pose (position and orientation). The focus of this work will be the latter; on-line reconfiguration of poses for a system of mobile cameras. This work will assume that an off-line calibration method has been used to select an appropriate sensing setup for the task at hand. Typically, these methods accept a generalized description of the sensing task at hand (including the sensing goal, workspace constraints, and any a priori knowledge of the OoI) to maximize one or more system outputs, such as average performance or average visibility. Methods are also available to optimize off-line configuration in a humanassisted manner. As an example, the off-line method used in Chapter 3 accepts any a priori knowledge of the environment, including a bounding area for the OoI to move within, and outputs the number and static positions of cameras to yield the best average-case sensing performance. Real-time, real-world sensing-system reconfiguration relaxes the above problem to allow for near-optimal performance [90]. In the real-world, future positions must be predicted from past and current data and, thus, are inherently uncertain. For real-time operation, hard deadlines on sensor pose

58 33 decisions are imposed, which may also result in near-optimal performance if the system does not have time to search the entire sensor-pose space for an optimal solution. At the core of the sensing-system reconfiguration process is the selection of performance-optimal sensor poses. The exact process to select these poses will strongly depend on the sensing application. In this case, the goal is to sense the action of a TVG object. These objects are inherently affected by the same sensing issues as any other object: Occlusions Various objects which are not of interest to the system may clutter the sensing environment. These obstacles may occlude the view of the OoI, reducing its visibility, and hence the task-performance of the system. Furthermore, these obstacles may be dynamic or maneuvering on a priori unknown paths. Such obstacles may be highly intrusive, and may partially or completely occlude the OoI from certain views. A reconfigurable sensing system must model, track, and address all obstacles possible in order to select the best possible sensor poses [61]. This necessitates pose prediction and tracking for all obstacles modeled in the environment. The system requires past and future information about obstacles to make sensor poses decisions, as this choice is time-dependent. The system must possess sufficient capability and time to allow sensors to move to un-occluded views. Clutter The environment may also be cluttered with multiple, static or dynamic objects which are not (or cannot be) explicitly modeled and tracked by the system. These objects may interfere with the sensing process by introducing noise or un-modeled occlusions into the sensor data. The sensing system must be robust to the noise created by a cluttered environment [86]. This applies both to the sensing tasks and the system reconfiguration tasks. Both inherently depend on information generated from sensor data, and as such are vulnerable to noise. Moving or Maneuvering Target The OoI may itself be moving or maneuvering on an a priori unknown path. Just as for any obstacles, this path must be tracked, recorded, and predicted for pose selection to occur [126]. The object may also be purposely evasive, and may possess greater mobility than the sensing system. Environmental Variation The sensing environment itself may change over time. This may include long-term variations in lighting or background objects. It may also include changes in the sensing task itself, such as new obstacles or target behavior. The system must be designed in a flexible manner, which is robust to long-term variations, and which can easily be extended and reconfigured to address changes in the modeled objects [127].

59 34 In addition to these issues, which may occur when sensing almost any type of object, TVG objects exhibit problems unique to their nature. These include the following: Non-Uniform Importance of Viewpoints All viewpoints are inherently non-uniform in their importance to the sensing task. As shown through the previous sensing issues, issues such as obstacles and occlusions may greatly reduce the amount of useful information that a particular view can provide to the sensing system. However, fixed-geometry object sensing systems assume that if two views have identical parameters (relative pose to the object, etc.), but are taken from two different times, they will still contain the same information. Specifically, such systems assume that the object does not change significantly in appearance over time [90]. The only time-varying object parameter which these systems address is the OoI s 6-dof pose. However, a TVG object may change significantly in appearance by performing actions. The same relative view of an object will not contain the same information at two different times in an action. Thus, the importance of each relative view changes significantly over time and the sensing problem becomes a continuous sensing problem. Sense-and-forget solutions are no longer applicable. Self-Occlusion TVG objects, which may be deformable or articulated, often exhibit significant selfocclusion. Many fixed-geometry sensing methods model the object as a solid primitive, such as a sphere, cylinder, or cube (e.g., [83]). This cannot account for any occlusion caused by sub-parts of the object itself. In the case of highly articulated objects, such as humans, self-occlusion can significantly reduce the useful information available from otherwise un-occluded views [77]. Continuous Sensing Both of the above issues are related; the importance of any relative view of the OoI is a strong function of time. However, as identified in the previous section, methods to recognize a TVG object action require a continuous, un-interrupted stream of action feature vector estimates for optimal performance. While one goal is to improve the quality of any views provided by the sensing system, the most important goal is to ensure that surveillance of the object is never interrupted the object must be continually sensed at all times. This requires real-time operation; all methods and implementations must be designed with real-time in mind [128]. This is not a trivial task, as real-time operation itself imposes strict time constraints on all operations Next-Best-View Problem To formulate the above issues as a coherent, solvable sensing problem, this next-best-view (NBV) problem formulation will be used. This formulation states that with each subsequent reconfiguration of the sensors in a sensing system, the system should seek to maximize the amount of unique, useful information about the object that is uncovered [23]. In this manner, the sensing task will have the

60 35 optimal chance of success. This formulation inherently incorporates all of the above issues; poor quality views will naturally have less useful information for the sensing task Sensing System Tasks Given the NBV problem formulation of the problem at hand, it is now possible to list the tasks necessary for the reconfigurable sensing system to perform. They are as follows: Detection All objects in the environment must be detected upon entering and exiting the workspace. This includes not just the OoI, but all obstacles which are to be explicitly modeled and addressed during sensor pose selection. As mentioned above, these objects may be moving or maneuvering on a priori unknown paths, and may be highly intrusive, presenting numerous partial or complete occlusions. Tracking All objects in the environment must be continuously tracked while they are within the workspace. The paths of these objects must be recorded to be used later in path prediction and sensor pose decisions. The system should seek to minimize uncertainty in the pose estimates of all objects, but preference should be given to the OoI. Prediction The system must be able to predict future poses of all objects in the system, with minimum uncertainty. These predictions will also be used to decide future poses for the system. It must also be able to predict future forms of the object whenever possible, given any known or past information about the object and its current action. This information too can be used for pose decisions. Pose Selection Given all past and current information about the environment, OoI, obstacles, the current object action, and the system state, the system must select new sensor poses at each instant of operation which maximize the amount of previously unknown, unique information being sensed. In doing so, the system should seek to achieve globally optimal performance. In this context, globally optimal performance refers to the best possible sensing-task performance, in the case of perfect a priori knowledge (effectively impossible in most real-world scenarios). This process must operate under strict time deadlines, while evaluating as many possibilities as possible. Without complete a priori knowledge of all paths and object actions, it will typically be impossible to select globally optimal poses in all cases [129]. Thus, the revised goal of the system is to achieve performance as close to the global optimal as possible. The system will seek to select optimal poses for a reduced problem poses which are globally optimal for the case where the system knows only past, current, and predicted information

61 36 about the environment, is already in a specified state determined by previous reconfigurations, and has a finite amount of time and reconfiguration ability. It is desirable that the global optima of this reduced problem should be near the global optimum found for a case where the system has complete a priori knowledge of all environmental motion. However, in reality, without perfect prediction or complete a priori knowledge, there can always be cases where solutions to the reduced problem will not be equivalent to the true, globally performance-optimal solution. Also, to allow for real-time operation this constraint must be relaxed further; there may not be sufficient time to evaluate all necessary configurations to find the global optimal for the reduced problem. In these cases, this work accepts that the system may select near-optimal poses, although the system should seek to be as close to the optimum as possible. While this may seem like a significant reduction of scope, this is the reality all for real-world, real-time sensing systems. Complete a priori knowledge is rarely available for any significantly complex environment, and real-time operation means the system never has infinite time and resources to ensure a global optimum is found. However, these goals as stated will yield improved performance over no reconfiguration at all, and are designed to yield desirable average case performance. It is up to the system designer to determine if the gain/cost tradeoff is appropriate for their particular application. Reconfiguration A pose is feasible when it is within the physical limits of the sensors, and is achievable when it can be reached before a hard deadline imposed as part of the sensor pose decision. Thus, given a completed pose decision that is both feasible and achievable, the system must physically perform the reconfiguration requested. This is the final sub-part of the reconfiguration task. 2.4 Base Problem Model Given the above tasks, it is now possible to formulate the problem at hand analytically. In doing so, this work will be formulating a problem which an automated system can explicitly solve to achieve the reconfiguration outlined above Performance Optimization As identified in Section (2.3), the goal of a reconfigurable sensing system is to optimize performance of the sensing task, which in this case is TVG object action recognition. Thus, the generic form of the core sensing problem can be expressed mathematically as a generic maximization of performance, Pr: argmax (2.7)

62 37 In Equation (2.7), the vector is the vector of sensor parameters, and is the sensing task performance, as function of sensor parameters. Inherently, performance is also a function of many other variables, such as environment contents (obstacle poses, etc.) and system state. However, for the purpose of formulating a machine-solvable optimization, the only controllable variables are sensor parameters. Thus, all equations in this section will omit additional uncontrollable variables, unless explicitly required. Given this base formulation, one can transform the optimization into one which an automated sensing system can solve. First, the vector of sensor parameters can be simplified to just those that the system will control online sensor poses: (2.8) argmax (2.9) In Equation (2.8), is the 6-dof pose for sensor,. For a system with sensors, performance becomes purely a function of the controllable variables the poses of all sensors in the system. The sensor poses are functions of time, and the optimization should select the complete functions,, which maximize performance. However, sensor-pose decisions must be selected on a per-instant basis. At any instantaneous point in time, past sensor pose decisions are fixed (they have already been decided, and sensor movement has occurred). Future decisions can be planned based on predicted data, but must be adjusted to match future data, as well. As such, the decision can be collapsed to only the current instant. Equations will omit the notation for simplicity, unless explicitly needed. At this point, the performance function must be defined to continue further. However, past work has shown that expressing performance of a sensing task purely as a function of sensor poses is an intractable task [77]. In many cases, a closed-form notation simply does not exist, except for comparatively simple sensing tasks [130]. To address this issue, this work defines a secondary metric which can be optimized in place of performance Performance as a function of Visibility As identified in the literature review, the majority of sensing tasks, including tracking, object recognition, and action recognition all depend on clear, un-occluded views of the OoI. This is the type of sensor data that provides the most unique, previously unknown data about the object being sensed [22]. An intuitive measure of the quality of a viewpoint is the visibility of the object. Humans often refer to visibility as a measure of how much of an object is visible. This work defines a visibility metric which extends this concept the visibility metric, as defined here, is a measure of the quality of a viewpoint for a given sensing task:

63 38 (2.10) In Equation (2.10), the visibility metric for sensor is expressed as a function,.., unique to sensor, of this sensor s pose at the current instant. This unique function,.., can be expressed in closed-form notation, as it is part of system design. Thus, its form is specified, rather than determined. One constraint that must be imposed, however, is that when choosing the visibility metric, a monotonically increasing function must exist which relates visibility to performance for the sensing task. One may note that this monotonically increasing function does not need to have a closed form; it need only exist: (2.11) In Equation (2.11), the function.. is the monotonically increasing function which relates performance to visibility. In this manner, increasing the visibility of a sensor should only have one of two effects on performance: (i) performance will remain the same, or (ii) performance will increase. The visibility metric must be specified such that increasing visibility will never decrease performance. This allows one to replace the performance optimization with a visibility metric optimization, which is purely a function of controllable variables, and has a closed form: argmin (2.12) In Equation (2.12), the optimization has been re-written as an optimization of visibility metrics, where is a combination function. The purpose of this function is to control the relative importance of individual sensor visibilities in the overall optimization. If the visibility metric chosen is independent between sensors, then, the optimization can be performed on a per-sensor basis. This is only the case when changing one sensor s visibility does not affect any other sensor s visibility. Even in these cases, it may still be desirable to artificially steer the system to prefer certain sensors Final Optimization Given the new optimization in Equation (2.12), it is now possible to formulate the per-instant optimization process. This optimization is performed by the system at any instant where the system determines a change in sensor poses must be made. These instants will be termed herein as demand instants. Any variables specific to the current demand instant are always labeled with a script 0. The per-instant optimization is shown in Table 2.3:

64 39 TABLE CODE LISTING FOR PER-INSTANT VISIBILITY OPTIMIZATION For each demand instant, t j, j = 1 to m, perform the following: [1] Given..,..,...., Line ; k = 1 to [2] Perform argmin ; l = 1 to j [3] Continue loop only while. [4] In the above optimization, the system considers a horizon of 1 demand instants, ranging from the current instant,, to the instant,. The optimization begins with given information;.. is the pose of sensor at demand instants 0 to,.... is the pose of the OoI at demand instants 0 to, and is the pose of the modeled obstacle at demand instants 0 to, with a total of modeled obstacles. The visibility of sensor at the demand instant is given as. The visibility optimization is the same as above, except the function is a new objective function which combines the visibility of all sensors over multiple demand instants. This function will be discussed in the following section; it allows the system to successively examine predicted future instants in an attempt to improve future, expected visibility along the demand instant horizon. Generally, future instants will be weighted to account for their increased uncertainty, owing to pose prediction. The optimization successively examines a horizon which extends farther into the future, up to a maximum depth of Demand Instant. This process is interrupted if the time spent processing the optimization,, exceeds a hard deadline set by the system,. This ensures that real-time constraints are met, while examining as many options as possible. This optimization is not yet complete; however, as the many constraints imposed by the problem and the sensing system must now be included. 2.5 Problem Constraints Chapter 1 identified several potential constraints on the performance optimization. All solutions to the sensing-system reconfiguration problem must respect these constraints, and it is simplest to enforce them at the lowest level as part of the core optimization problem Physical Constraints The potential solution space is first constrained by the physical capabilities of the sensing system. It is assumed that an off-line sensing-system reconfiguration algorithm has been used to determine a suitable selection of sensors and their capabilities, as well as their initial locations. A real-world

65 40 dynamic sensor would have limits on its movements. The most basic constraints are on absolute 6-dof pose: (2.13) In Equation (2.13), the pose for the sensor,, must always between the upper and lower physical limits, and respectively. This work does not consider cases where sensor dof may be coupled to other sensors. For the case of a static sensor,. It is assumed that there is no significant overlap in sensor capabilities; the system does not directly address sensor collisions or distribution of highly similar sensors. For articulated object sensing, the use of an offline reconfiguration method will rarely generate a sensor set with significantly redundant capabilities, as it is an inefficient use of resources [62], [131]. Deformable objects and other application-specific cases (such as an existing sensor system) may still use sensor systems with significant overlap, so the proposed method is designed to be extensible to such cases, although they are not directly addressed in this work. The set of all possible sensor poses which satisfies Equation (2.13) is called the feasible pose set, and is referred to as for the sensor in future equations. Aside from its absolute pose, the motion capabilities of the sensor could also be limited: (2.14) (2.15) From Equations (2.14) and (2.15), this work assumes a constant-acceleration, maximum-velocity model for sensor motion. Acceleration for the sensor,, is fixed to its maximum magnitude,, and sensor velocity,, is limited in magnitude to. These constraints do not directly affect the set of feasible poses, but when combined with the current sensor state (position, velocity, and acceleration), they will be used to form additional constraints Time Constraints At any time instant being considered for a pose decision, the sensor will be in a certain state of motion defined by its current pose, velocity, and acceleration. It is assumed that acceleration can be changed instantaneously to any desirable value. The pose decision is selected for the immediate future instant, so the desired sensor poses must be reached at or before this instant. Given the current sensor state, it is possible to define a range of achievable poses, which a particular sensor can reach before

66 41 the next instant. The set of these poses will be referred to as, and it is defined by poses which satisfy: (2.16) In Equation (2.16), _ and _ are the lower and upper achievable limits for sensor, given the current state of the sensor. These limits can be determined from the current sensor state of a single sensor using Equations (2.17) to (2.25), stated for a single sensor dof, : (2.17) (2.18) (2.19),,,,,, (2.20) (2.21) (2.22) (2.23), _ (2.24), _,, (2.25) From Equation (2.17), for sensor (subscripts are omitted for clarity), the time remaining for motion is defined as, the difference between the current instant s time,, and the next instant,. Equations (2.18) and (2.19) calculate the time it would take to achieve maximum velocity in either direction as and, given a current velocity of, and the maximum velocity and acceleration from before. From these times, the maximum distances from the current position reachable in the upper and lower directions, and respectively, are determined using Equations (2.20) and (2.21). Adding these values to the current dof value,, in Equations (2.22) and (2.23) yields the un-truncated upper and lower achievable limits, and respectively. Finally, Equations

67 42 (2.24) and (2.25) truncate these limits to fall within the feasible pose space, defined by upper and lower limits and, Equation (2.13). The resultant achievable lower and upper limits, _ and _ are used to form the complete vectors _ and _ in Equation (2.16). It can be noted that the set of achievable poses,, is, thus, a strict subset of the feasible poses of the system: (2.26) These two constraints, on feasible and achievable sensor motion, form the basic physical constraints imposed on all sensor pose decisions by the system itself Other Constraints Aside from the constraints above, it is also important to constrain the tradeoff which occurs when the system optimizes visibility for future (predicted) instants. Under the optimization scheme in Table 2.3, the system may trade small gains in immediate visibility for large expected gains in near-future visibility, depending on the combination function. To ensure that a minimum level of visibility is maintained for all instants, a constraint is imposed on the resultant visibility of the optimization: (2.27) In Equation (2.27), refers to a minimum level of visibility which must be maintained for every sensor in the final pose decision. Some visibility of the object is necessary to ensure (i) tracking of the OoI can be maintained, (ii) predictions about the next instant can be reconciled with actual data, and (iii) continuous data is available to the action recognition process. The final value of is largely application specific, and is determined during implementation Complete Optimization By combining all the above constraints, the final core optimization method for the reconfigurable sensing system can be re-written, Table 2.4: TABLE 2.4 CODE LISTING FOR COMPLETED CORE OPTIMIZATION TASK For each demand instant, t j, j = 1 to m, perform the following: [1] Given..,..,...., Line ; k = 1 to [2] Perform argmin ; l = 1 to j [3]

68 43 Subject to,, and. [4] Continue loop only while. [5] The above optimization implements all of the above tasks and constraints. Moreover, it will allow the system to examine multiple, future instants along the time horizon to maximize visibility over time. The nature of the loop in the above code listing makes the system first consider poses only for the immediate future instant,. If time remains, it will progressively include more instants in this optimization:,,,,, and so on. It is expected that these future instants are weighted against their inherent uncertainty using the combination function,. As abovementioned, the goal is to allow the system to trade near-future visibility for larger expected gains later. In this manner, the overall optimization is not a greedy algorithm, even though the baseline perinstant optimization is greedy (as a result of the monotonicity requirement on the visibility metric). By considering multiple instants along the rolling horizon, the optimization at each instant is preconstrained to account for any desired trade-off of current visibility for future visibility. As such, it is possible to avoid any local maximum which would otherwise be selected by a greedy algorithm. There is no reason for the system to select non-optimal poses for the per-instant problem, as any trade-off is already captured by the constraints on the problem. This division better models the linked nature of sensor pose decisions, and is simpler to implement in practice. Naturally, its performance is tied to the quality of the available predictions of future object behavior. Many situations arise in typical applications where this behavior is useful. The sensor system does not have perfect a priori knowledge of all object paths; as such, it does not know if it is making a series of pose decisions which, although optimal in the short term, may significantly worsen the system s choices later. This is a common occurrence with obstacles and a limited sensor movement window. Let us consider an example where a subject is about to walk behind an obstacle. The sensor may gradually move in one direction, towards its feasible limit, to keep an un-occluded view. Eventually, it will reach its travel limit, and the subject will pass behind the obstacle. The sensor must now accept several instants of complete occlusion while it moves over a significant portion of its travel range to re-acquire the subject on the other side of the obstacle. If, instead, the system accepts partially occluded views earlier in this process, it can move close to the obstacle, in preparation to quickly pass it and re-acquire the object on the other side. These views are obviously lower in quality than completely un-occluded views, but avoiding several later instants of complete occlusion may be worth the trade-off. More importantly, continuous surveillance of the object is achieved; as defined above, the general-case TVG action recognition process strongly prefers continuous data to partial, but slightly more complete data. The final piece of the sensing-system reconfiguration problem is the

69 44 visibility metric itself. It is a principal part of the optimization, and must be carefully designed to satisfy the monotonically increasing relation to performance. The following section will outline a baseline visibility metric that can be customized to suit most TVG object action recognition applications. 2.6 Visibility Metric As abovementioned, the choice of a visibility metric is a key part of the reconfiguration problem outlined in this work. This section will present the metric developed over the course of multiple, carefully controlled simulations and experiments. The results of these experiments will be discussed in later sections, including how they were used to form the metrics detailed here. For now, a brief description of their function will be provided Past Visibility Metrics Some earlier systems designed for fixed-geometry objects also incorporated the concept of a visibility metric for their performance optimization (e.g., [61], [90]). These metrics are often more simplistic than that which will be necessary for TVG action recognition. However, they still can provide a starting point for the design of a TVG metric. These methods utilized the fact that an object would not change significantly in appearance, and additionally constrained the problem to objects with uniform appearance. In [61], objects are modeled uniformly; any view of an object would be acceptable, so long as it is un-occluded. Hence, the visibility metric focuses solely on occlusions by obstacles in the environment a key factor. The metric itself is binary, occluded or not occluded, 0 or 1, although later works (e.g., [78]) extended the metric to a spectrum of values representing partial occlusions. These metrics are useful for sensing pose-invariant features, such as object position. Others further extended these metrics to include simple differentiated view concepts some views offer more useful information than others. For example, in [90], the visibility metric is designed to prefer principal views of a human face which are most beneficial to face-based recognition. Other works, such as [22], incorporate the next-best-view problem for fixed-geometry objects. In [23], the goal is to construct a 3-D depth map of a fixed object. The system seeks to recover the most unknown information in the depth map with each iteration. Many similar systems exist (e.g., [124]), as this problem has eventually become a field of its own. These works rarely address obstacles, but even for un-occluded views there is differentiation identified. In particular, many of these work find that views which center the feature of interest, and which offer as much detail as possible of the feature, produce the best results. These findings are incorporated into the

70 45 visibility metric developed in this dissertation. Later feasibility studies modify and significantly extend these basic findings to address issues specific to sensing TVG objects and their actions Visibility Metric Properties To define the visibility metric, it is important to identify a few basic properties. The visibility metric is assumed to be normalized; it ranges from 0 to 1 in value. Visibility is defined on a per-sensor basis as a combination of weighted sub-metrics, and is application dependent. A sub-metric is simply a component part of the overall visibility metric. The following sub-metrics were determined through feasibility studies which characterized the type of views most preferred by a number of commonly used vision methods. From these results, three sub-metrics were selected that represented factors common to all algorithms tested. As such, the following sub-metrics can be considered to be generalpurpose sub-metrics. For a given task and implementation, it is expected that one would add additional metrics which further capture characteristics of the sensing task or implementation that are not completely represented by just these three metrics. They should be viewed as a starting point, rather than an exhaustive set. As such, the basic visibility metric is defined as a weighted sum of sub-metrics: (2.28) 1 (2.29) In Equation (2.28), the visibility metric for sensor at any instant is given by the sum of sub-metrics, individually weighted by. The selection of weights is entirely application dependent, and will depend on the choice of sub-metric, and what system behavior is most desirable for the sensing task. The sum of all weights must be 1 to ensure the final visibility is between 0 and 1. Sub-metrics are assumed to be normalized to values between 0 and 1 as well. Common sub-metrics are discussed in the following sections Metric Distance A distance metric seeks to select poses for which the camera focus is physically close to the object being sensed. This has multiple effects on the resultant views. First, views tend to be centered on the object if no other metrics are used, but this effect is strongly out-weighted in most real-world cases. More importantly, this metric prefers large, close views of the object being sensed. The uncertainty in parameter estimation, such as point triangulation, tends to increase significantly with distance from the camera. Preferring poses close to the object tends to offset this effect. This metric is given by:

71 46 (2.30) max (2.31) The sub-metric of distance,, is the distance for a given pose from the sensor s focal point,, to the object s center of mass,. This is weighted by the maximum possible distance,, given the sets of feasible and achievable poses for the current instant. The sensor s focal point is a pure function of pose,, determined by the sensor model Metric Angle An angle metric seeks to select poses with the OoI centered in their view. Having objects near the center of view provides a number of benefits. First, it maximizes the average amount of time an object must be moving before it moves out of a sensor s field of view. Many sensors exhibit distortion or error (i.e., cameras) which is greater near the edges of their field of view. It also increases robustness to error in poses estimation. To create this behavior, the following metric is used: (2.32) cos (2.33) max (2.34) In Equations (2.32) and (2.33) above, the angle metric,, is defined by the angle from the sensor focal line to a line which intersect the sensor center point,, and the object center,. The sensor focal line intersect the sensor center point and the sensor focal point,. Essentially, it is the direction vector of this sensor. This is maximized by the maximum possible angle for the current instant,, which is calculated using the maximization in Equation (2.34). As in Equation (2.31), the exact optimization objective will depend on the sensor model Metric Visible Area The final common sub-metric measures visible surface area of the object. This metric seeks to maximize the amount of surface area of the object visible to the given sensor. This improves performance for many feature-based algorithms, such as pixel-based tracking methods, and object identification methods. To account for quality of the resultant sensor data, the effect of foreshortening is included in the metric. The surface area is projected through the sensor; this metric measures the

72 47 amount of surface area visible after quantization in sensor data. Therefore, a view containing the whole object as a single pixel in a large camera image, for example, would not be artificially preferred over more useful views. The metric is as follows: 1 (2.35) Above, the visible portions of the object are separated into distinct areas, with projected surface area. Typically in a camera-based system, these will be image pixels (typically determined as a directly measured value from in-system or simulated images). Again, the summation of all areas is weighted by the maximum possible area,. The choice of is application dependent, but there are two general choices: (i) consider as the sensor resolution, or (ii) consider as the maximum possible area given the same projection conditions, but no obstacles. There is a trade-off inherent in this choice; Option (i) is simple to calculate, but tends to scale poorly. An object typically has a low dynamic range of projected area for a given sensor system, so this option tends to strongly overestimate the maximum possible visible area in most cases. Option (ii) is more accurate, but must be calculated on a per-instant basis, and requires accurate knowledge of obstacles/occlusions, detailed models of all objects (including obstacles), and significantly more calculation. The system designer can choose the best method for their application Model-Based, Multi-Sub-Part Metric The above metrics inherently assume that the object is a uniform object, and that every part is of equal interest. However, as identified in the discussion of the NBV problem, given a certain action, some sub-parts of the object provide more useful information for distinguishing the action than other parts. For example, the legs are most relevant in distinguishing a human walking motion. When sensing TVG object actions, viewpoints have non-uniform importance due to the nature of actions. This effect is increased when sensing an action over time, as the system may have detailed information on some sub-parts, but little information on others. Those with little known information also become relatively more important, increasing viewpoint differentiation. An articulated, modelbased method for action recognition has also been chosen. All of these factors suggest that a uniform object model of visibility is not optimal. As part of Section (2.2), it was identified that for an articulated object, certain sub-parts of the object are relatively action-invariant. It would be useful to model these sub-parts as individual objects, each with a separate visibility metric. The resulting metrics can be fused with weights to control the relative importance of each sub-part:

73 48 (2.36) 1 (2.37) From Equation (2.36), for an object with parts, the visibility of the object to sensor at the instant,, is the weighted sum of the visibility metric of each sub-part, with the subpart metric given by for sensor and instant. This sub-part metric is calculated exactly as in previous sections for a single, rigid sub-part of the overall object. The weights,, are controlled by the system to be proportional to the relative importance of each sub-part at a given instant. This process will be examined in more detail in Chapter 3. To maintain the previous definition of visibility, Equation (2.37) states that all weights must sum to one. This is the final definition of visibility, which incorporates all the qualitative sensing tasks into a single mathematical formulation. 2.7 Summary This Chapter introduced the core optimization problem, which is solved in a reconfigurable sensing system to achieve on-line pose selection for a TVG object recognition task. The critical sensing tasks for TVG object action recognition were outlined as detection, tracking, estimation, and action recognition. Moreover, the action recognition task was formulated mathematically as a comparison of feature vectors which are aggregated over time. Characteristics of this process were identified which, if directly addressed by a sensing system, could improve action recognition performance. Thus, a reconfigurable sensing system, specifically designed to improve performance in these tasks, will be proposed in Chapter 3. The tasks for this system, detection, tracking, prediction, pose selection, and reconfiguration were qualitatively examined in this Chapter. Attributes specific to TVG objects, such as non-uniform importance of viewpoints, self occlusion, and continuous sensing, were identified as issues which a reconfigurable sensing system could address. After formally outlining all requirements of such a system, a detailed, mathematical formulation was developed. The key concept of this formulation is the use of an alternative visibility metric to form an objective function for the system to maximize. Visibility is defined as a pure, closed-form function of sensor poses, while maintaining the existence of a monotonically increasing performance function. This allows the system to directly implement the optimization in terms of sensor pose selection. Finally, the visibility metric itself was developed as a summation of several sub-metrics. These application-specific metrics can be added through a weighting process to create a variety of sensor pose selection behaviors which best suit the task and sensing system. The visibility metric was also formulated into an articulated metric which allows the system to assign weights to sub-parts of the

74 49 object based on their relative importance. In this manner, the issues identified in the qualitative examination of TVG object action recognition were formulated into a mathematical problem. This problem will be examined further in Chapter 3, wherein a generalized framework which inherently solves the optimization will be proposed.

75 3. Customizable TVG Action Sensing Framework In Chapter 2, a detailed formulation of the time-varying geometry (TVG) action-sensing problem was outlined. Using this problem formulation, a reconfigurable sensing system task problem formulation was created, which an automated system can directly solve. The resultant optimization problem must be implemented by a sensing system, and the optimization itself is necessarily customizable to a variety of TVG object sensing applications. Key qualities of the sensing system, such as continuous sensing, real-time operations, and robustness were identified as necessary traits for any system implementing real-world TVG object action sensing. While the problem formulation inherently addresses continuous sensing, is designed to allow real-time operation, and enhances robustness, the latter two issues are left largely to the system designer. As such, it is necessary, as part of a complete solution proposal, to provide a formal, customizable framework which gives a system designer the tools to necessary tailor the theory presented in this dissertation to their specific application. This Chapter will present the proposed real-time active-vision framework for TVG object action recognition. This chapter first presents theory for the chosen pipeline architecture (Section 3.1). Using basic pipeline theory, the customizable framework is developed (Section 3.2), including detailed explanations of all stages in the resultant pipeline (Section 3.3). This section (Section 3.3) also demonstrates in detail how the pipeline operates in real-time. 3.1 Pipeline Background Given the requirement for real-time operation determined in Chapter 2, it is useful to select a formal, well-known real-time architecture as the basis for the proposed customizable framework. This dissertation proposes the use of pipeline architecture [132] as a framework for sensing-system reconfiguration. The pipeline structure is analogous to an assembly line; rather than sequentially process one element from start to finish before starting the next, processing can be subdivided and tasks performed in parallel. The total time it takes for one element to traverse the pipeline will always be equal to or greater than (due to overhead) that of a non-pipelined system [133]. The benefit lies with increasing the rate that elements exit the pipeline the average sensor pose update rate, in the case of the sensing problem at hand. The basic concept of parallel computing is the concept of speedup; the speedup measured for a task is given by, which is a factor of the original, sequential time to complete the task [134]. For any given task, some parts of the problem will be parallelizable, and others will not. The portion that can be performed in parallel,, determines the upper limit on the speedup a parallel architecture can 50

76 51 achieve. Two common upper limits are defined by Ahmdal s Law [134] and Gustafson's Law [135], Equations (3.1) and (3.2): (3.1) 1 1 (3.2) In Equation (3.2) above, is the number of parallel processing elements used. The first equation, for Ahmdal s Law, assumes a fixed problem size, and that the sequential portion of the problem (i.e., the non-parallelizable portion) is independent of the number of processing elements. The second equation, for Gustafson s Law, does not. Furthermore, pipeline theory states that the best-case speedup achieved by a pipeline is equal to the number of pipeline stages (neglecting stage overhead, [133]). Thus, total maximum speedup is limited by both the architecture limit and an upper limit given by the task itself. In practice, the system will likely achieve a speedup below the maximum limit due to real-world issues, such as stage overhead [133]. Detailed experiments in Chapters 5 and 6 will examine the speedup in a real-world implementation. Parallel operation in a pipeline also introduces the concept of hazards [136]. Structural hazards occur when two pipeline work units require the same functional hardware simultaneously. Data hazards can occur when shared or global data elements must be accessed by multiple stages in the pipeline simultaneously. Control hazards are associated with conditionality and branching, which is not present in the problem at hand. All hazards identified in the proposed pipeline, plus the associated solutions (such as forwarding paths and parallel hardware), will be detailed in the following sections. 3.2 Pipeline Architecture Overview Early research into sensing-system reconfiguration yielded quasi-real-time systems, such as [64] and [76], the details of which are presented in Chapter 4. This system was designed with real-time operation in mind, but several factors made this difficult to realize with the hardware available for many applications. In particular, the method was designed around a central planning architecture, wherein multiple agents communicate with a central planning agent to select sensor poses. It was identified in [137] that this structure contains a single, long, multi-step critical path which can be exploited through clever system design to improve the average update rate of the system. Furthermore, several sub-tasks were found to have significant parallelism, as will be explained further in Section 3.2. The result of a detailed re-design process, with real-time operation in mind, is shown in Figure 3.1.

52 FIGURE 3.1 OVERVIEW OF PROPOSED CUSTOMIZABLE PIPELINE ARCHITECTURE 3.2.1 Update Structure Given this pipeline framework, it is first useful to examine the structures and data representations which will be passed between stages.

77 52 FIGURE 3.1 OVERVIEW OF PROPOSED CUSTOMIZABLE PIPELINE ARCHITECTURE Update Structure Given this pipeline framework, it is first useful to examine the structures and data representations which will be passed between stages. The chosen pipeline type is called a synchronous, buffered pipeline [133]. This form of pipeline uses a global system clock to control the transition of data between stages. While it is possible to use an asynchronous pipeline, there would be little benefit in this application. Even though there is not a perfect load balance between all stages, asynchronously buffering the input from faster stages would not improve performance, due to the sequential nature of a pose decision. Furthermore, the execution time of sub-tasks was found to be relatively constant for most stages, which is also best suited for a synchronous pipeline. As such, the pipeline uses single-

78 53 buffering between stages, unless specified otherwise. The system clock itself is externally controlled, and may be adjusted independently. The pipeline work unit is considered to be a pose update, which consists of the desired sensor poses for all sensors, selected by the system to be achieved before the next demand instant. The pose update travels from Stage L1 to Stage L10 in the pipeline. At the input of the pipeline are raw images, quantized directly from the environment by the sensors (cameras), and at the output of the pipeline are the low-level motion controller and motion stages, which implement the sensor motion scheme decided upon by the system. The representation for this decision changes significantly over the course of the pipeline as data is manipulated. Input data is in the form of a matrix of quantized pixel values. These raw data are filtered to remove un-interesting pixels, and compression is used to pass a cropped, compressed image between Stages L1, L2, and L3. These cropped images are searched for features of interest, and the data representation is reduced to a vector of 2-D pixel locations for all points of interest, transmitted from Stage L3 to L4. This vector is processed into a vector of normalized camera coordinate constraints, which are transmitted to Stage L5. This stage produces a vector of 3-D world coordinate estimates for all visible model features and object locations, which is transmitted to all subsequent stages in the pipeline in sequence. Additionally, Stage L6 generates an estimate of the current subject feature vector, which is passed to Stages L7, L8, and L9. Stage L7 generates predictions of future subject feature vectors and world coordinates of all environment objects, which are transmitted to Stage L8 only. Stage L8 accumulates all information in the pipeline and produces potential sensor pose decisions, which are transmitted through Stage L9 to Stage L10. Stage L9 produces the actual estimates of the current subject action. Finally, Stage L10 emits sensor motion commands based on the potential pose decision received from Stage L8. Multiple forwarding paths exist as well. Due to the highly complex nature of these interactions, and the dynamic data representation, each of these transitions will be discussed in detail in Section 3.3, below Pipeline Depth and Superscalar Execution The selection of ten pipeline stages is part of a tradeoff between pipeline depth (and inherent latency increases) and parallelism. It was identified through controlled simulation and experiments (see Chapters 4 and 5) that many of the tasks necessary in any TVG object action surveillance system exhibit coarse-grained parallelism, or are embarrassingly parallel [133]. The former term refers to tasks with little or no interaction, and the latter term refers to tasks whose parallel structure is almost completely inherent; they can be readily separated and processed without interaction or sequence. As identified above, the linear nature of the sub-tasks in generating pose decisions naturally lent itself to

79 54 a pipeline structure. Furthermore, natural divisions in functionality are evident in the process. However, these natural divisions do not necessarily balance the load in the pipeline. Some sub-tasks, such as image correction or pose selection, are inherently more computationally costly than other operations, such as pose prediction. But, the majority of these computationally expensive sub-tasks happen to be highly parallelizable. Under the chosen 10-stage pipeline and the corresponding labor divisions, the system achieves (i) better functionality abstraction and encapsulation than the previous highly coupled system, and (ii) improved load balancing over a centralized architecture. For any stages in the pipeline which would otherwise be overloaded, superscalar execution (through parallel pipes and hardware) is used to reduce execution time. As will be shown in Chapter 6, the sections of the pipeline with parallel pipes are found to have close to 1, meaning they are indeed highly parallelizable. 3.3 Pipeline Stages Given the overview of the complete pipeline presented in Section 3.2, it is now necessary to examine each pipeline stage in detail. This section will present a description of the functionality each pipeline stage may implement and must implement. For all functionality, a review of potential implementations and details for chosen reference methods will be included. This section will begin at the head of the pipeline, with Stage L1. It should be noted that any pipeline stage may be collapsed into another stage if the load balance allows. In general, this will not change the central optimization process, provided the current pipeline order is maintained. Changing the order may introduce new pipeline hazards other than those already addressed, necessitating additional forwarding paths or other measures Stage L1 Imaging Agent The Imaging Agent is responsible for primary sensor data collection and processing. This means the stage is pre-processing oriented. As identified in Section 3.1 and 3.2, the benefits of using pipeline architecture depend on appropriate load balancing and minimum inter-stage overhead. Raw images, as captured by cameras, are the least efficient data representation used in the sensing system. If these raw images were to be passed through multiple stages, they would create significant overhead and inter-stage storage requirements. It has been proven in past work that for most sensing tasks, even good quality images contain significant extraneous or redundant information [138]. Raw pixel formats, such as those captured by most cameras, are also the least space-efficient way to represent an image. As such, it is beneficial to first reduce the amount of raw data which will be passed to Stage L2. All savings will be propagated throughout the pipeline, meaning the effort/benefit tradeoff is high

80 55 the system should expend maximum effort to remove non-useful information early in the pipeline [133]. Useful information must furthermore be transmitted using as compact a representation as possible. Since this stage will already be working with raw images by nature, it is also a natural point to implement any per-element operations on the raw image data. Both of these tasks will be discussed in detail, with a detailed list of options provided which the system designer can customize to their particular sensing task and sensor data. The above discussion inherently assumes image-based sensors (cameras), although the type, parameters, capabilities, and other internals are deliberately left unspecified. It should be noted that the method can also be used with additional sensor types, such as infrared sensors, ultrasonic sensors, etc., provided the necessary data fusion can be performed, and that the visibility metric (Stages L6- L10) can be suitably specified to provide the best dataset from these sensors. Encapsulation of functionality has been specifically used, such that the sensors need only create a cloud of detected 3- D world-coordinate feature points (Stage L6). How they arrive at this cloud can be changed without significantly affecting the input side of later pipeline stages. However, to achieve best results overall, one does need to consider the selection of poses which bests suits these sensors - as mentioned above one must select a suitable visibility metric. Image Capture and Quantization Block The beginning of all operations in the proposed framework is data acquisition. For a camera-based sensing system, this means the cameras themselves. As shown in Figure 3.2, the cameras are at the input of an internal pipeline for Stage L1, which has its output connected to Stage L2. This pipeline may be implemented in hardware or it may be virtualized. The number of parallel pipes should generally be equal to the number of physical sensors in the system. The use of parallel pipes was chosen to allow for heterogeneous sensors data needs to be converted to a uniform representation before being transmitted to the subsequent stages. However, it is useful to treat per-element operations separately before data combination to allow for maximum flexibility and parallelism.

81 56 FIGURE 3.2 STAGE L1 INTERNAL PIPELINE AND SUB-BLOCKS Image capture begins with the quantization block in each parallel pipe. Every physical sensor in the system is associated with one quantization block. For the purpose of this research, it will be assumed that the sensors used in the surveillance system are image-based sensors, typically cameras. Most common camera types can be used with little modification to the method that follows, including narrow-focus and panoramic cameras. It should be noted that in these cases, off-line reconfiguration must still be performed to select the number, types, and internal parameters of the cameras comprising the system. As such, some camera types (which benefit little from the active aspect of the system, such as panoramic cameras) would be less likely to be recommended for use, although their use is certainly not precluded. The basic camera projection is shown in Figure 3.3, showing the projection of an image onto a plane of atomic elements, or pixels.

82 57 FIGURE 3.3 PROJECTION OF FEATURE POINT INTO PIXEL COORDINATES All 3-D world geometry visible at a given camera view is projected onto the 2-D projection plane, given in pixel coordinates. A single pixel is assumed to be a vector of color components, such as the RGB model for pixel,, as given by Equation (3.3):,,,, (3.3) Under this representation,,,,, and, are the quantized red, green, and blue color values for pixel, respectively. Quantization inherently assumes that real-world color intensity levels are mapped to a finite range, such as for an individual color channel under 24-bit color. The quantization process itself, including its physical control, is considered to be outside the scope of the system. It is assumed that any implementation will have a readily available quantization process and accompanying software specified for the hardware chosen. The purpose of the quantization block is to encapsulate this functionality; the output of the quantization block is an array of raw pixel value vectors, where is native the resolution of the imaging device. If all devices in the system are homogeneous, these blocks may be virtualized, especially if black-box drivers or controller software is needed. Proper encapsulation should be maintained, however. The quantization process itself will occur at a specified rate, typically determined by the capabilities of the sensor hardware. The system designer may be given the choice of a synchronized, clock-based sampling hardware or asynchronous hardware for image capture. The remaining stages before the synchronization block are design to operate asynchronously, to allow for either type of hardware to be used. It should be noted regardless of the type of camera is used, the effective average

83 58 capture or frame rate must be sufficient to prevent aliasing, ghosting, and over-use of extrapolation during the synchronization step. It is assumed that cameras deliver quantized images at their maximum rate continuously. As mentioned above, at each quantization instant, the output of the quantization block is the raw image,, given by Equation (3.4):,,,,,, (3.4),,, The raw image is given for an image of size pixels, the atomic element of a raw image. All subsequent equations will omit the vector notation from any pixels unless explicitly required. The raw images are passed to the pre-processing filter blocks. Pre-Processing Filter Blocks The purpose of the pre-processing filter block is to implement per-pixel filtering on input data. Perpixel filtering is used for a number of purposes. The primary purpose is for image correction, or the removal of artifacts introduced by real-world noise, variations, and imperfections in both the environment and the quantization process itself. The exact selection of filters used will strongly depend on the application and the camera hardware being used. Several common filters and their reference implementations are discussed in Appendix A. Interest Filter Blocks The interest filter blocks are the last per-pixel filter blocks before the alignment stage. The goal of this block is to generate a per-pixel interest map of the image [28]. As identified in Chapter 2, even un-occluded views of the OoI may contain significant redundant or useless information. It is beneficial to remove this information early in the pipeline, as it reduces computation and transmission costs for all future stages, and it increases the Signal-to-Noise Ratio (SNR) of the overall data set. Region of Interest identification, or Interest Filtering [33] is a method to identify regions of potential interest in a raw image by applying a series of low-level filters and thresholding/compositing the results. Although initially applied purely to OoI detection problems, these filters can be used in this application to selectively remove portions of the image that are highly unlikely to be needed by the rest of the pipeline. To implement interest filtering, the image is pre-segmented into areas of interest using a series of customizable Interest Filters, similar to those in [27]. In this framework, the interest of pixel, is, :

84 59, 1, max,,,, 0, (3.5) In Equation (3.5),, is the response of the input filter (total of K filters, where 0, 1), and is a user-defined minimum level of interest. The logical-or effect of the maximum value operator selects all regions that could be of interest under any of the chosen filters. Individual filters may be implemented using a variety of methods, and are highly application specific. Appendix B provides a non-exhaustive list of some common interest filters and their implementation details. In addition, Chapter 6 provides details of all interest filters used in the experimental setup, although these should not be considered a reference set; they are completely customized to one specific object class. If the system has sufficient processing resources, or a suitable set of filters cannot be determined by the system designer, this stage can be omitted at some loss of overhead. After interest filtering, the regions with zero interest must be removed before transmission of images to Stage L2. The raw image is first weighted by the calculated interest levels:,,,,, (3.6) In Equation (3.6),,, is the raw pixel value of color channel, pixel,, and,, is the output image pixel. This weighting has the effect of zeroing all regions of the image with interest below the user-define minimum of. Large contiguous regions of single-color pixels are easily compressed or removed through compression algorithms in the Synchronization Block. After interest filtering, the system may include a check on all remaining per-pixel operations to exclude all zeroed pixels from calculations. In this manner, both transmission bandwidth and processing time are reduced. Alignment Blocks After all pre-processing operations have completed, the resultant image must be aligned and corrected to a uniform coordinate system, normalized camera coordinates. For most camera models, this will involve a distortion correction step and a projection step. This block is responsible for removing lens distortion, as part of a more complete calibration model. This block may be omitted if a lens distortion model is not used as part of camera calibration. In general, each pixel, with initial pixel coordinates,, is assumed to be in a distorted location, as determined by the lens of the camera. Given a calibrated transform, these distorted coordinates will be pushed through the calibrated lens transform to yield true pixel coordinates for each input pixel,,. These new pixel coordinates are recorded to sub-pixel accuracy initially to allow the system to perform anti-aliasing techniques.

85 60 First, the pixel set is centered and cropped back to the initial image size. The new set of pixel locations will rarely span all pixels in this window some will be blank, and others may map to multiple input pixels. A combination of interpolation and anti-aliasing techniques must be applied to construct the complete image. OpenGL provides many tools designed exactly for problems such as this, thus the reference implementation presented in Chapter 6 makes extensive use of these tools. Other, more general methods can be applied, such as [139]. This task is a common computer vision task, and little more can be elaborated here. As part of the reference method chosen for camera calibration, the experimental setup implements a distortion model based on the Plumb-Bob model or Brown-Conrady Model [140], and an overall method based to the CalTech Camera Calibration toolbox (a reference is not available for this toolbox, but the method it implements is based on [141] and [142]). This method is part of the complete system calibration method developed as part of the experiments in Chapter 5. As such, complete details of this block will be presented as part of the discussion on this calibration method. In general, the choice of the lens model will entirely depend on the camera more expensive lenses typically require low-order distortion models, while less expensive lenses, such as webcam lenses, will require significant distortion correction for best results. Many developed options at all levels of distortion correction exist, so it is impossible to review all possibilities; however, the selected reference method has an adjustable order of distortion correction, and thus should be applicable to the majority of cases. Synchronization Block Time Stamping - The synchronization block is responsible for assigning world-clock timestamps to all images produced by the imaging devices. This is achieved through a global world time scale, to which all system clocks are synchronized. To begin, it is assumed that the hardware for each imaging device maintains its own system clock, and that this clock is available in some standard format. For example, later experiments use a free-running -resolution counter. Images are captured by each camera and stamped with the local system time. The synchronization block hardware has its own internal clock. It is assumed that all clocks are stable, and do not exhibit significant drift. A clock synchronization scheme, such as [143], is used to produce an estimate of the difference,, between the synchronization block s current time,, and that of each imaging device,,, where is the number of cameras in the system. Thus, the world times for each image are given by: (3.7)

86 61 In Equation (3.7), the word time is simply the image s local timestamp,, plus the difference time differential between that camera and the synchronization block hardware,. If the hardware for all of Stage L1 is shared and imaging occurs simultaneously, then this step can generally be omitted, as is essentially zero. After appropriate world-time stamps are attached to all images, the data must be compressed and buffered for transmission to Stage L2. Compression After all image correction, interest filtering, and alignment operations have completed, the final step is to compress the image before buffering and transmission to Stage L2. This step is arguably one of the most critical in this agent for real-time operation; the bandwidth and overhead savings are passed through the entire pipeline. Similarly, the cost of the compression and decompression algorithm must be considered as a trade-off for this savings. Image compression itself is a common task in computer vision and graphics, and as such it is impossible to review all possible techniques in this dissertation. A non-exhaustive list of potential methods, including benefits and drawbacks, can be found in Appendix C. In general, the method must meet a few key criteria. First, the compression must be lossless, or must guarantee that any compression artifacts do not interfere with the chosen vision methods implemented in Stage L3. Ideally, the method should maintain a good compression ratio due to interest filtering, there will be significant contiguous regions of a single color, and certain algorithms can exploit this fact to significantly increase compression. Finally, the algorithm must operate in real-time for both compression and decompression. This is often the most difficult criterion for these algorithms to satisfy, especially for compression. After compression, images must be buffered before transmission. Buffering and Transmission The final step in this block is image buffering and transmission. As will be explained in detail in Section (3.3.2), this agent must buffer the asynchronously-generated images in preparation for a synchronization process. Stage L2 will later request pairs of images with timestamps that satisfy certain requirements to ensure that synchronization of all sensors is possible. As such, a rolling buffer of images must be maintained. Images are added to this buffer immediately upon generation, and are removed once their timestamp is older two demand instants from the current instant. While the depth of this buffer could be reduced by careful selection of which images to drop, in general, storage is not considered costly in most implementations, while the added flexibility is valuable to the system designer. Once sufficient pairs of images are transferred to Stage L2, the pipeline operation continues with a synchronization process.

87 Stage L2 Synchronization Agent As part of a complete solution to capture and synchronize images taken from multiple imaging devices, world time stamps were applied to all images in Stage L1. However, there is no guarantee that all sensors will produce an image at a single time. As such, an alignment step, as implemented by the Synchronization Agent, is required to increase image coherency and achieve the best possible tracking results. The Synchronization Agent uses the synchronized global clock to select a single world time,, to correspond to the current demand instant. For the sensor in Stage L1, the stage is asked to transmit two time-stamped and interest-filtered images,,,, and,,,, which have corresponding world time-stamps and. These images are selected such that the following constraint is satisfied: (3.8) Once these images have been received, they are first decompressed using the appropriate method to match the compression method used in Stage L1. If, for any sensor, a pair of images which satisfies this constraint does not exist, this stage has two options: (i) wait for additional images, or (ii) use extrapolation. If the first option is used, the stage waits for a fixed delay, modified appropriately, and repeats the query. This process can be costly, however, as the new may fall outside the range of other sensor image pairs, which would also have to be re-transmitted. If these sensors do not have new images, there is a danger the agent would have to wait continuously, especially when there are many sensors in the system. As such, the agent defines an internal waiting deadline, after which time all available sets of image pairs are examined. The selection of does not necessarily have to correspond with the current world time, only with a time uniformly represented across all images. As such, the system selects the set of image pairs which would require the least use of extrapolation. Once images are received and a final value of is selected, image-wide, per-pixel interpolation/extrapolation is applied to construct an image of synthetic pixel values,,,,, that occurs exactly at :,,,,,,,,,,,, (3.9) This interpolation/extrapolation assumes that is relatively small, otherwise ghosting of edges will occur [144]. Overall, image extrapolation should be considered as less accurate than interpolation in this application. By moving this synchronization functionality to a distinct pipeline stage, load balancing is achieved. The extra time available to the agent can be applied to allow

88 63 additional images to be captured, improving the chance of finding a value of which satisfies Equation (3.8) for all sensors. Once the synthetic image is constructed for all sensors, the results are re-compressed using the same data compression algorithm as Stage L1, and are transmitted to Stage L3 on the clock transition Stage L3 Point Tracking Agents The Point Tracking Agents generates 2-D locations in pixel coordinates for all points of interest visible in each sensor s filtered, aligned, and synchronized image. These points of interest will correspond to critical model points (typically, joint locations), allowing the system to eventually reconstruct an estimate of the current OoI form. In general, since the framework may potentially be adapted to a variety of articulated TVG objects, a distinct point-tracking stage provides the most flexibility. The system designer is free to select a set of algorithms which best suits the object at hand. Thus, from the de-compressed input images received from Stage L2, this stage must determine the 2-D pixel locations of all feature points that are visible to the system sensors. These 2-D locations form constraints in Stage L4, and are used to recover 3-D positions and the OoI 6-dof pose. Moving obstacles must also be detected and tracked as part of this process. Static obstacle locations are assumed to be a priori known, as part of system setup. Thus, feature points can also be added for any desired obstacles in order to unify the tracking process. To achieve 2-D tracking, two distinct tasks are necessary (i) point detection, and (ii) point tracking. For point detection, any visible feature points that are not currently being tracked must be detected. Each feature point must be uniquely detected, but the mapping of feature points to model joints may be delayed until Stage L5. In general, only a small subset of feature points need to be uniquely identified, in order to recover an estimate of OoI pose and remove rotation and scaling model effects. In some cases, all points may be initially unidentified, if another method of de-rotation and scaling is available. The method of feature point detection will be inherently application-specific, but a summary of some common detection methods and their applications is presented in Appendix D. Note that many of methods examined in this Appendix form localized descriptors (i.e. edge-based, color-based, etc.). Many of these descriptors can be combined into larger descriptor vectors to better described (and, potentially, uniquely identify) a particular feature of interest. Statistical methods such as PCA or LDA can also be applied to reduce the feature database size and corresponding search times. Any points that are detected must also be tracked; updated world coordinate estimates of feature positions are required at each demand instant. Tracking is performed in 2-D, as with detection, allowing the system to form constraints and determine 3-D world coordinates in Stage L4 and Stage

89 64 L5. Methods of 2-D tracking are numerous, so a summary of common methods is presented in Appendix E. For any positively tracked feature, when a tracking loss occurs, the feature is marked as undetected. The chosen detection method is then applied to re-acquire a positive track. However, the search area is limited to a circular region centered at the last known pixel location for the feature. This circle has a radius of in pixels: 2 (3.10) In Equation (3.10), is the average frame-to-frame pixel displacement of the feature over the last frames, and is the standard deviation. The value is a user-selected variable controlling the length of the window. This equation simply assumes that the average change in pixel coordinates is approximately normally distributed, in which case there would be a 95% chance the feature point is in the area searched. If the feature remains undetected after frames, the search area reverts to the entire image. This accounts for temporary tracking loss by minimizing searching (a computationally costly operation). As shown in Figure 3.2, the pipeline splits into a parallel pipeline, and the level of parallelism is selected to match the number of sensors. In general, load balancing in the pipeline parallel pipes must always be considered. In this respect, it is important to note that most methods of detection (Appendix D) are significantly more costly than corresponding tracking methods (Appendix E). Thus, while this stage could be made into two separate stages (detection and tracking), the detection stage would still require significantly more time. No significant gain would be made, and pipeline depth would be increased. A simple solution is to make detection (excluding the limited local search mentioned above) an asynchronous process. As such, this stage implements cycle stealing with Stage L4. Detection normally completes in one cycle, and Stage L3 operates as normal. If the detection step runs long, all completely detected and tracked points are passed to Stage L4, but a portion of the Stage L3 hardware will finish the previous detection step. Results are passed through a forwarding path directly to Stage L5. If the detector is already busy, it will ignore new raw images clocked in to Stage L3. In this manner, the system can borrow a cycle from Stage L4, thus preferring a complete search of one image over partial searches of two images. The use of superscalar architecture allows tracked points for other sensors to be unimpeded by this process.

90 65 Selection Requirements As mentioned above, the specific methods of feature point detection and tracking are purposely left unspecified; the system designer would select those best suited to the particular OoI being sensed. For example, later experiments in Chapters 5 and 6 use a combination of PCA-based image search and a modified Optical Flow method for tracking a robotic human analogue (similar to [145]). While sufficient for this application, they were found to require significant set-up effort when tracking real humans. In a real-world scenario, the system designer would simply select more appropriate algorithms in this case, based on an iterative design process. In general, any chosen method of 2-D feature detection or tracking must (i) be view-invariant, or robust to viewpoint changes, (ii) be able to detect individual feature points on a per-frame basis and update their tracks while the feature remains visible, and (iii) be real-time applicable. One may note that the second requirement is sometimes difficult to satisfy many methods inherently assume static cameras; relative motion which is not caused solely by object motion may not be directly addressed. As will be examined in Chapter 7, this is an area for potential future work. Lastly, methods which (i) require little or no a priori information, (ii) minimize set-up effort and database size/complexity, (iii) minimize search time, (iv) uniquely identify the feature point, (v) maximize individual success rate in detection, and (vi) minimize overall implementation complexity, are preferred Stage L4 De-projection Agents Stage L4 takes detected feature locations in pixel coordinates (, ), given by 1, and performs de-projection to form world-coordinate constraints. It is assumed that these coordinates have been filtered to remove lens distortion in Stage L1, and that they are determined from synchronized image sets. The goals of this stage are (i) to recover the normalized camera coordinates of each detected feature point, / / 1, and (ii) use these coordinates to form constraints in world coordinates, which Stage L5 can solve. The first task above is typically performed using a simple inversion of the calibrated camera transforms, unless an unusual camera model is used. For example, in Chapters 5 and 6, pixel coordinates are related directly to normalized camera coordinates: / / (3.11) 1 In Equation (3.11), is the intrinsic camera calibration matrix, as determined during camera and system calibration. For the chosen reference model [142], this matrix can be directly inverted to yield

91 66 normalized camera coordinates from pixel coordinates, and vice-versa. Details of this matrix and transform will be presented in Chapter 5, wherein complete system calibration will be examined. The de-projection step uses the normalized camera coordinates to form constraints in world coordinates. For simple lens models, each detected point forms a ray passing through a corresponding point on the camera projection plane in world-coordinate space. However, under more complex models, such as the distortion model in the reference implementation [140], lens distortion and extrinsic camera parameters must be accounted for. The result is a non-linear constraint which is expressed in parametric form:,, (3.12) In Equation (3.12), the constraint curve,, is given by parametric positions in world coordinates,,, and, based on a single parameter,. The actual parametric equation s form will depend on the chosen system and camera models. In an ideal situation, these parametric curves may be decidedly non-linear, and can form complex solution sets, as there may be multiple partial intersections between different sets of curves. However, in the real world the problem is more complex; often these curves may present intersections, points of tangency, and points of close approach. In effect, they no longer define a simple intersection, but rather an area (or multiple areas) of likelihood within with the true intersection is expected to exist. As such, this complex problem necessitates a separate pipeline stage, which is Stage L5. The goal of this stage is to recover the best estimate of the true world coordinates of the feature point represented by the above constraints Stage L5 3D Solver Agent This stage is responsible for determining an estimate of each feature point s 3-D world coordinate position,, from the constraints created in Stage L4. Points with an insufficient constraint set are ignored. Thus, this problem will often be over-constrained (the minimum of two cameras viewing a point provides four unique constraints and three unknowns under a linear projection model). If the calibration models have a linear closed-form inverse, then a direct optimization-based solution can often be used [146]. Typically, however, the constraints will form a complex set of 3-D surfaces, with multiple intersection points or even edges. The global solution must be taken as a solution of the optimization problem:

92 67,, (3.13) In the optimization of Equation (3.13),,, and are the coordinates of the closest point to,, that satisfies constraint, with constraints total. is a weighting factor, which is either a constant 1, or a weight from 0 to 1 based on confidence in that constraint. This optimization is subject to multiple local minima due to the intersection of constraint surfaces. The actual optimization complexity will strongly depend on the models chosen, thus it is not possible to recommend a general reference optimization method. The output from this stage is a cloud of world-coordinate feature points, possibly with associated identities, if Stage L3 is able to uniquely identify feature points. These will be passed to Stage L6 to produce an estimate of the current OoI form. It is also important to note that alternate methods of 3-D location can be used with little modification to the previous stages. Off-axis rotation can be used to provide depth constraints from the calibrated cameras, or stereo-vision pairs can be used (provided they are determined to be necessary during the off-line calibration step). Aside from adjusting load balancing between Stages L1-L5 as necessary, such methods can be used directly in this framework with no other modification. The above triangulation-based method does, however, tend to provide better results in most multicamera cases due to the large number of physically separate cameras sensing a single target. As an additional benefit, the method also tends to be simpler to implement Stage L6 Form Recovery Agent As described in Chapter 2, the Stage L6 contains the Form Recovery Agent, the goal of which is to recover an estimate of the current OoI form, regardless of the current OoI action. If complete information about the OoI form is not available from the current set of sensor views, the agent must attempt to fill in missing information whenever possible. In addition, contextual information about the OoI, including model constraints and other a priori knowledge, may be used to improve the quality of the estimated form. The input to the form recovery stage is a cloud of detected feature point locations, given by: (3.14) In Equation (3.14), the feature point location, in world coordinates,, is given by the 3-D world coordinates,,, and found by Stage L5. Feature points which were not detected in the current set of views are omitted from this cloud. Note that even if the vision method used in Stage L3 is able to uniquely identify all feature points in the environment, this information should not be

68 directly used to immediately produce the estimated form feature vector it will be incorporated later in the process. For now, it is assumed that only four reference feature points,.

While static reference points may seem inapplicable for TVG objects, under the definition of an articulated object outlined in Chapter 2, many articulated TVG objects possesses some static geometry

93 68 directly used to immediately produce the estimated form feature vector it will be incorporated later in the process. For now, it is assumed that only four reference feature points,.., are always uniquely identifiable. All other detected feature points,.., do not need to be uniquely identified. Action library poses consist of total points. While static reference points may seem inapplicable for TVG objects, under the definition of an articulated object outlined in Chapter 2, many articulated TVG objects possesses some static geometry sub-parts. Indeed, most real objects in practice have multiple stationary points between actions. For example, the experiments in Chapter 6 using humans take the center of the face, neck, chest, and shoulders as reference points. These features can be readily identified, and exhibit little motion for the actions to be recognized. If stationary reference points are not available for the OoI, it may still be possible to uniquely identify a subset of regular feature points. The iterative process that follows would then have to include an intelligent hypothesis tester, which first selects a subset of as... The process would then determine a model fit, including scaling and rotation constants, which would be verified. If the form estimate is not feasible, a new subset of would then be selected. This process is inherently more costly, and should be avoided in practice. A flow chart representing the iterative form recovery process is given in the following figure, Figure 3.4: FIGURE 3.4 ITERATIVE FORM RECOVERY PROCESS FLOWCHART This process is a multi-level iterative process which first removes translation, scaling, and rotation effects before continuously combining model constraints and other a priori information to produce an estimate of the subject form. The detailed process is as follows:

94 69 Step 0 If reference points are not available, select a subset of as.., temporarily removing them from the set of all points to be fitted. This selection must be performed by an intelligent hypothesis tester, such that these points are representative of the object center,, and the principal axes,... If suitable points are not available in the point cloud, the hypothesis tester must instead provide an estimate of (see Step 2), which it must later verify through comparison to the action library. In general, this process is complex, costly, and typically can be avoided through careful examination of the object for stationary points. The vast majority of articulated TVG objects will have such points, although deformable objects typically will not. The other alternative is to use a rotation- and scaling-invariant method of action recognition in Stage L9, although this will increase the implementation complexity significantly. For the purpose of the reference implementation, it will be assumed that reference points are available, either through direct detection or from an intelligent hypothesis tester. Step 1 - Taking as a the new origin, remove the translation effect from all points. This origin point will typically be selected as the object center-of-mass (CoM) for most objects. In cases where the CoM may be in a non-stationary sub-part of the object, another point can be used. All actions are recorded relative to this point, so that the recognition process is independent of world position of the OoI. The equations to remove the translation are as follows: (3.15).... (3.16) In Equation (3.15), is the new object-coordinate position of the feature point input,. Similarly,.. are the new reference points corresponding to.., with world translation effects removed. Step 2 Following translation removal, the general 3-D rotation and scaling matrix,, must be recovered. This matrix defines an object-coordinate rotation and scaling effect which transforms the incoming point cloud to a pre-defined orientation and scale. The matrix can be determined through the following optimization: argmin (3.17) In the above Equation (3.17), the matrix is determined through an argument minimization, where is the default position (i.e., an un-rotated location, given a priori by the model library) for the

95 70 corresponding object-coordinate reference point,. Essentially, the goal of the minimization is to find a rotation matrix which minimizes the distances between the rotated point and the expected location of the point. Step 3 After the rotation matrix has been recovered, it can be applied to all other points in the point cloud. This is a simple matrix multiplication: (3.18) In Equation (3.18), the de-rotated point,,is found through directly multiplication using the rotation matrix,. It should be noted that, if available, more than three reference points can be used to estimate in Steps 1 and 2, three is only the minimum needed. Step 4 It is assumed that as part of off-line OoI analysis, a number of model constraints will be found. For example, in an object with rigid segments, such as limbs for humans, these will be rigidsegment length constraints. Given a rigid segment with multiple joints, the total distance between these joints is fixed, regardless of orientation. Any similar constraints, so long as they are forminvariant, may be used. Given a total of model constraints, a measure of constraint quality is determined for each combination of constraint and feature point: (3.19) In Equation (3.19) above, is the metric of constraint quality, and is the closest point to which satisfies constraint, where 1. It should be noted that most constraints will depend on the assignment of feature points. If this value,, cannot be calculated for a given constraint due to the location not yet being identified, the calculation is omitted. All metric values which can be calculated are examined to determine the best matches. For a given detected location in the feature point cloud, if, where is a minimum constant, the corresponding location is added to a list for that detected point. If multiple values of are in the list after all constraints are examined, the with the lowest associated is chosen as the final location,. This final location is also assigned an identity corresponding to that required by the corresponding model constraint which produced. In this manner, this step uses model constraints to (i) determine an initial identity for feature points which satisfies model constraints, and (ii) adjust detected feature point locations to satisfy model constraints, as well. Any points which are not assigned an identity in this step are listed as unassigned. The difference between the available points and the number of assigned/unassigned points gives the number of missing points. If, in later iterations, there are constraints which can be

96 71 evaluated, but no points in the feature point cloud satisfy the constraint, the associated points (including all subsequent linked points in the model constraint tree) must be labeled missing as well. Step 5 Once all possible constraints are included, and the numbers of currently missing and unassigned points are determined, a metric of the total uncertainty in the current estimate of the form,, is calculated: (3.20) In Equation (3.20), is the number of missing points, is the number of points still unassigned, and is the number of assigned points. The values of,, and are proportionality and weighting constants. The point is the interpolated point with lowest Euclidian distance to in the current action. If there is no current action,, where is the predicted location estimated by Stage L7 for this instant. In this manner, knowledge of the current action is used whenever possible to give the most accurate estimate of the uncertainty in the estimated form. However, this process still produces only an estimate of uncertainty, and must be treated as such. Step 6 Due to the hierarchical nature of the model constraints (typical for articulated, skeletonbased TVG object models), the agent must iterate to include the next level of model constraints. To do so, increase by an increment and repeat Steps 4 and 5 until either 0 or increases from the previous iteration. If the latter case occurs, all remaining unassigned points are considered missing. With each iteration, any previously identified/assigned points remain identified and assigned, allowing subsequent levels of the constraint tree to be evaluated. Step 7 To attempt to fill in missing points in the model feature vector, the agent replaces all missing feature points in the model with the predicted location,, as estimated by Stage L7. This estimate is performed in the immediately previous instant, and fed-back to this stage on the clock transition. The agent now repeats Steps 4 to 6 to once again apply model constraints and find any remaining unassigned points. Step 8 Finally, with a completed estimate of the form feature vector, a final value of can be calculated (as in step 5). The final set of points,, are composed into a potential OoI feature vector and saved in a list structure. If, at any iteration in the previous steps, there were multiple points which satisfied, the agent will repeat Steps 4 to 7 for all permutations of point assignments. Even in this process, the fit may get stuck in a local minimum. If this occurs often, the system designer can include a relaxation operator to selectively relax model constraints based on other a priori or online

97 72 knowledge [147]. This will change the seed mapping in Step 4, allowing Steps 4 to 7 to be repeated, testing more permutations of feature assignment. Once all desired permutations are tested, or an internal deadline is reached, the feature vector with the lowest value of is selected as the final form estimate for this instant. One may note that if feature points can be uniquely identified, this method should still be used. This method iteratively fits a best estimate of the current subject form, favoring model constraints or predicted pose, depending on user parameters and the hypothesis selector. It uses all possible information in this estimation, increasingly the likelihood of a positive form match over a blind recovery of points. As such, even if the mapping of feature points is a priori known, the method can still produce a better estimate of current form by filling in missing points using predicted information, and by using model constraints to reduce the effect of outliers Stage L7 Prediction Agent Stage L7 contains the Prediction Agent of the framework, which is responsible for predicting both future object poses and OoI form feature vectors. These estimates are a critical part of the data used by Stage L8 to make future pose decisions. As specified in Chapter 2, all objects in the environment must be positively tracked. Part of this process should be to add all observations of the detected object positions to a predictive filter. As will be shown in later experiments, estimates object poses inherently have additive noise due to estimation uncertainty. The goal of the predictive filter is to reproduce the true object motion path. A number of developed solutions exist, which must be selected by the system designer based on the expected object motion. Typically, this motion will be characterized as part of the off-line calibration process. If only short-term prediction is needed, as is the case for some quantities in the experimental setup presented in Chapter 6, a simple windowed linear regression method can be applied. However, most real-world objects exhibit complex paths with significant acceleration. For a general purpose reference method, this work recommends a Kalman filter (KF) with second-order state variables (i.e., position, velocity, acceleration): (3.21) In Equation (3.21), the position, velocity, and acceleration of the object are given in state space by, which contains the object s 3-D position,, velocity,, and acceleration. Velocity and acceleration are the first and second derivatives of position, respectively. As will be shown in Chapter 4, this method was the simplest to implement of the available methods which also provided suitably accurate

98 73 predictions. For all predictive filters, uncertainty inherently increases when predicting farther into the future. For object paths which can inherently be represented by second-order motion paths, this filter is ideally suited, and uncertainty is not significant, even when predictive relatively far into the future. For objects with more complex paths, it was found that short-term predictions will typically still be within the operational bounds of the method. Thus, if the system update rate is kept sufficiently high, this filter can often still be used, even for more complex motion paths. The window length of the KF will also affect its applicability in this situation; it was found that a window length equal to the pipeline depth produced the best average case results. As mentioned above, it is up to the system designer to select a method that is suited to their object motion characteristics and system update scheme. If long-term prediction is required and the object paths cannot be represented with a linear state space (as in the KF), then more advanced non-linear filters may also be used. The Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) [148] can be applied to objects with complex paths, such as sinusoidal or circular paths [149]. However, it should be noted that these filters are considered to be more difficult to implement and tune, and they may still be unreliable for some non-linear paths. Another option exists in the Particle Filter (PF), which is a model estimation technique based on simulation [150]. Results from the PF can be very accurate, approaching the Bayesian optimal estimate [151], but in practice it can be complex and computationally costly to implement in real-time. However, all of the above methods can be used within the proposed framework, depending on the designer s choice; the use of a KF is mainly for simplicity. As an additional note, advanced prediction methods which combine a priori knowledge of the targets behavior and motion can also be used within this framework. For example, human subject motion could be better predicted by a more complete model which combines a simple predictive filter for short-term prediction with long term psychological and behavioral models. In general, as long as any prediction operations are computationally bounded in complexity, almost any form of motion prediction can be used within this framework. Form Prediction The second task of the prediction agent is to predict the future form of the OoI. While analogous to the problem of object pose prediction, due to the use of individually-tracked feature points, the system often possesses contextual information about the object s current action. This information can be used to significantly improve the form estimate at little extra computational cost. To begin, each feature point for the OoI, in world coordinates, is tracked and predicted using the selected predictive filter method, yielding a predicted location. If no action is being positively recognized by the system,

99 74 the final predicted feature point location is simply. Otherwise, this predicted location will be fused with a secondary prediction based on the current OoI action: (3.22) (3.23) (3.24) In Equation (3.22), is the predicted feature point location as given by the action library, and is the final, fused prediction. The weights and are given by Equations (3.23) and (3.24), wherein they are inversely proportional to the uncertainty in the two predictions, and. To calculate the predicted feature point from the current OoI action,, a request is sent to Stage L9 containing the time displacement into the future necessary to predict the OoI form. An interpolation method is used (described in the Stage L9 description) to produce the estimate, which is sent along a feedback path back to this stage. Although this stage could directly calculate the interpolation itself, given a copy of the action library and knowledge of the current action, this method allows Stage L10 to incorporate two additional measurements into the prediction, potentially reducing its uncertainty. The output of this stage, the fused, predicted OoI form and all predicted object poses, are passed to Stage L8 to begin the optimization process Stage L8 Central Planning Agent The Central Planning Agent is the core of the proposed framework, as it implements the novel sensing-system reconfiguration optimization developed in Chapter 2. In order to perform this optimization, the agent must be able to evaluate the expected visibility at any given sensor pose within a sensor s feasible range. This requires multiple past, current, and predicted inputs, which have been accumulated in the pipeline up to this point. The exact method of evaluating visibility will strongly depend on the particular system setup, so it is not possible to specify a completely general purpose method. Chapters 5 and 6 will outline in detail a reference implementation which uses a combination of GPU processing and other methods to quickly evaluate the visibility metric for multiple potential sensor poses. This method is adaptable to a wide range of environments, but is best suited for human action sensing and the specific visibility metric developed for the experimental setup. In general, the visibility evaluation for any implementation of the Central Planning Agent must be highly efficient it may be called hundreds or thousands of times for a given optimization, making it a key determinant in the maximum possible performance of the overall pipeline.

100 75 Similarly, Chapter 6 will outline in detail the reference implementation chosen to perform the visibility optimization itself the Flexible Tolerance Method [152]. This method was chosen primarily for simplicity of implementation, although there are other factors to consider when selecting a method of optimization. The overall goal for this stage is to find the global optimum for the reduced reconfiguration optimization problem presented in Chapter 2. This global optimum for the reduced problem may not necessarily be the global optimum for the most general reconfiguration problem. The chosen method should ideally be robust to local minimums, and should guarantee convergence if possible. More importantly, visibility evaluation is a computationally costly operation; the optimization should evaluate as few configurations as possible. Any modern algorithm which satisfies these parameters can be used in place of the Flexible Tolerance Method. As part of the optimization process, the agent must maintain a cache of all potential sensor pose configurations examined for the past three instants (due to the pipeline depth difference between Stages L8 and L10). A coarse grid which spans the achievable pose space is also defined, with visibility being evaluated at each potential configuration in this grid. The choice of grid spacing depends exponentially on the number of dof in the sensor system. The number of visibility evaluations added should be kept to a minimum, typically less than 5% of those used in the optimization process. During normal system operation, this grid should be accessed by Stage L10 to determine alternate poses only rarely. However, some a priori knowledge of system layout can be used to increase the chance of the grid containing useful pose configurations. A simple scheme is to select pose sets starting from the offline poses for the system and working outwards towards the sensor limits; the selected grid does not necessarily need to span all of the pose space. All cached configurations must be stored in an easily searchable form. Stage L10 will occasionally request a search of these cached configurations, and any matching configurations are forwarded, along with visibility evaluations, directly to Stage L10. These secondary configurations will later be combined in Stage L10 as a last-chance measure to find a pose solution when the primary optimization solution does not satisfy global rules. Overall, however, this process should not occur regularly most desirable functionality should be implemented elsewhere in the system, directly, if possible Stage L9 Action Recognition Agent The action recognition agent implements the vision payload of the sensing system. As described in Chapter 2, the subject form is represented by a feature vector, and actions are represented by a sequence of these vectors, stored as an action feature vector. As part of the off-line calibration of the system, it is assumed that an action library for the subject will be captured and stored. This library is also encoded as a searchable data structure containing action feature vectors representing the average

101 76 case for each action. A detailed discussion of action capture and the action library will be presented in Chapter 5. It is also assumed that, being an average case representation, actions in the library are stored at their nominal length. As discussed in the Chapter 2, real-world objects often vary in their performance of actions. This includes variance in the action feature vector itself, and in the length of the action. The core operation of this stage is governed by a state machine, as shown in Figure 3.5. FIGURE 3.5 ACTION RECOGNITION STATE MACHINE The Action Recognition Agent begins operation in a default state, with no action positively recognized. When in this state, the agent performs a continuous search based on the incoming OoI form estimate determined by Stage L6. The stored action feature vector representation is given by a matrix of key-point form feature vectors:,, (3.25),, In Equation (3.25), the action feature vector for the action,, is given by an matrix, where is the length of the action (number of key feature vectors), and is the length of individual key form feature vectors. These form feature vectors, as described in Chapter 2, are selected as part of action library construction to be representative of key points of variance in the action. They must also implement sufficient sampling so as to prevent aliasing. Given a library of action feature

102 77 vectors, the agent first searches all actions to find the two closest form feature vectors to the current, estimated OoI form for each action. The Euclidian distance metric is given by:, (3.26) In Equation (3.26), the distance metric for the form feature vector of any library action,, is given by the sum of squared differences of action library feature vector element,,, and the corresponding estimated OoI form feature vector element,. The two adjacent form feature vectors from each action with the lowest average Euclidian distances to the estimated OoI feature vector, and respectively, are recorded. For each action, if this combined distance,, is lower than a user-defined threshold, a positive action start is detected, and the agent transitions to the start detected state. Once a potential action start is detected, the agent enters the start form detected state. In this state, the agent now examines each incoming estimated OoI feature vector to determine if it continues the same action, and if a time normalization constant can be calculated. Each incoming estimated form feature vector is compared to library actions using the above Euclidian distance criteria. Three potential state transitions may now occur: (i) false start, (ii) positive normalization match, and (iii) alternate start match. For Option (i) above, if a second positive match has not been determined after a user-defined time window of demand instants, and no alternate start matches are defined, the agent transitions back to the no action detected state. This represents cases where the system may or may not have detected an action, but in any case was unable to fix a positive match over time to determine a normalization constant. The window length should be selected to be representative of the average action library feature vector length. If it is set too short, true actions may be rejected before a normalization constant can be determined. If set too long, the system will waste resources searching for non-existent actions. The third option, Option (iii), represents a second match of an incoming OoI feature vector, only to a different action than the first match. In this case, either action start match may be incorrect, but it is impossible to tell until more positive matches occur. Thus, any additional matches of this type are added to a list of secondary potential matches. In subsequent instants, if a match is made with any of these actions, or the primary action, it becomes a positive action match and the system proceeds as normal for Option (ii) above. Each of these secondary actions also has a separate false-start timer, as in Option (i), where it will be removed from the list after a fixed number of instants with no subsequent matches. If the primary action match expires, as in Option (i), the newest secondary match is promoted to primary match. Finally, it may be prudent for the designer to require additional form

103 78 matches before declaring a positive action match if multiple secondary matches are detected. This often occurs for libraries with low inter-action separation. For reference, the experimental setup in Chapter 6 uses a single-instant penalty for each secondary match. As an example, if two potential matches are found (i.e., one primary, one secondary), then any one action must accumulate a total of three OoI feature vector matches (two base, plus one penalty) before being declared a positive match. If three potential matches are found, one primary, two secondary, then any one action must accumulate four matches (two base, two penalty). In this manner, the system attempts to reduce false positive classification at the point it is most likely to occur at the start of an action. Once a significant history of positive form matches is available, the likelihood of a false positive match drops rapidly. Obviously, there is a tradeoff, as it will take longer for initial classification to occur with larger penalties. Finally, for Option (ii), a positive action match is determined. Since not all actions will be performed at the same rate as the reference library action, a time normalization constant must be determined. To begin, the agent defines two form feature vectors, and, corresponding to the start and end feature vectors. These feature vectors are found through linear interpolation of the pairs of action library feature vectors which were found during the previous matching step. The start feature vector is found through interpolation of the first match for the action, and the end through interpolation of the last match. For each key form feature vector which comprises an action, a corresponding time displacement is also recorded in the library. The start of the action is coded with zero displacement, i.e., 0. As such, linear interpolation is also used to determine two time displacements, and, corresponding to the estimated time that the interpolated start and end feature vectors would occur. The system also records two world-time timestamps for these instants, and. The initial action normalization constant is thus determined by: (3.27) In Equation (3.27), the time normalization constant,, is determined by the ratio of the two differences in time. Absolute value is used to allow for actions performed in reverse to be detected as well. After determining the initial constant, the agent will transition to the action update state. In this state, the agent will attempt to maintain the time normalization constant based on new incoming OoI feature vectors. As such, there are three potential transitions for this state as well: (i) action update, (ii) skip update, and (iii) action end. For each incoming OoI form feature vector, matching is performed as in previous states. If a subsequent match for the current action is detected, as in Option (i) above, the action remains

104 79 positively recognized, and the time normalization constant must be updated. Depending on the object being sensed, the action may be performed at a constant rate, or the rate may even vary over the course of a single action. For a constant rate assumption, the current OoI feature vector can be used to calculate a new estimate of (generally using the previous / as the new estimate s / ). This new estimate can then be averaged with the previous value of, allowing the estimate to converge over time. If variable-normalization actions are evident, a sliding-window scheme can be implemented to average a fixed number of previous instants. If no action match can be found, or the match is for a different action, then Option (ii) may occur, where the normalization update is skipped. As in the previous state, if a maximum number of skipped updates occur, the positive action match is dropped and the machine transitions back to the no action detected state, as in option (iii). If one or more matches to a different action were detected prior to this transition, these may be processed immediately by the state machine as a new possible action. It should also be noted that in all of the above states, Euclidian distance can be replaced with any desired advanced metric of comparison. This metric was chosen primarily for simplicity of calculation, but is by no means central to the method. Multiple Actions To recognize multiple simultaneous actions, there are numerous methods which the system designer may implement. As defined in the previous chapter, multiple actions are considered to be additive in their nature. If it is not feasible to represent all combinations of multiple actions directly in the action library, the system can be modified to support action separation. In these cases, sections of the action feature vector are marked as DNC, or do not care, as part of the library capture. This marking, for a given feature vector element, means that the element does not contribute significantly to defining the corresponding action. For example, in a human waving motion, most parts of the body, except the arm waving, would be marked as DNC in the action library. Elements marked DNC are excluded from distance metric calculations. In this manner, the matching process can selectively ignore unrelated object subparts when recognizing an action. In this case, the agent implements a family of the above state machines, one for every simultaneous action which is to be recognized. The state transitions for these machines are performed in-order, to allow for mutual exclusion. Once a positive action match occurs for any state machine, the corresponding action is ignored by all other machines. Separate time normalization constants are maintained for each positively recognized action. Unfortunately, this still does not address actions which are tightly coupled. If two actions mainly use different sub-parts of the object, they can be easily recognized under the scheme above. However,

105 80 if not, the system must implement generic action separation. Each n-tuple pair of actions in the action library must be examined during the matching steps described above, and the relative strengths of the two or more actions recovered. This leads to multiple issues. The first is a combinatorial explosion in the number of action combinations which must be searched. In many cases, simple a priori knowledge about which action combinations are never likely to occur can greatly reduce this effect. However, it may still be more efficient to implement a spectrum of possible actions in the action library than to implement generic separation. Also, for complex action libraries with many actions, there may be multiple combinations of actions which resemble the current subject action. In general, contextual information is needed to separate all cases. This is a simple task for humans, but difficult for automated systems (arguably impossible under deterministic computing models). Advanced action classification methods have been proposed (e.g., [153] and [154]), with varying degrees of success in recognizing multiple simultaneous actions. This framework allows for such methods to be implemented, if necessary, although their development is significantly outside the scope of this work. In many cases, the system designer must verify first if such powerful methods are truly necessary for the object and actions at hand; many action recognition tasks can be completed successfully with a simple action library and representation Stage L10 Referee Agent The referee agent is the final agent before sensor motion is implemented in the environment. It is responsible for a number of tasks, including maintaining and selecting fallback poses, although its primary task is to implement rule set checking on the final pose decision. Rule-Set Checking In many applications, it may be beneficial or even necessary to define additional rules which the final set of poses must satisfy. For example, when multiple actions are to be recognized, a rule can be added to ensure that a minimum number of cameras are assigned to each action. Another example is a rudimentary multiple action sensing rule under such rules, the system will perform reconfiguration only for the primary subject, but the resultant views must also maintain a minimum visibility for a secondary point in the environment. Both of these rules are used in experiments in Chapters 5 and 6. Other examples included cases where sensor capabilities overlap they can be used to prevent redundant pose selection. In general, these rules attempt to implement functionality which either (i) cannot be created directly through modifying the optimization problem, or (ii) would be too costly to directly implement. As will be shown, these rules should generally be selected such that the majority of pose set solutions produced by the system already satisfy them. The rules are intended to reinforce

106 81 a secondary behavior in rare cases where a solution is determined which does not already satisfy the constraints. The reason for this is that the agent s ability to determine new poses which do satisfy the rules is necessarily limited. The optimization process described for Stage L8 is inherently costly. Many rules would necessarily increase the complexity of this optimization by a complexity factor of. This would lead to an unacceptable load imbalance in the pipeline. For this reason, a simplified method of determining alternate pose decisions is implemented. As was mentioned in the description of Stage L8, the agent maintains a cache of all configurations that were evaluated during the optimization process for a given demand instant. It also maintains a coarse grid of evaluations which are distributed throughout the sensor pose space. A forwarding path exists between Stage L8 and Stage L10. If Stage L10 determines that a rule is violated for the current instant s proposed solution, it sends a query (similar to a traditional database search query) to Stage L8. All cached pose solutions are examined, and any which satisfy the rule-based search query are forwarded to Stage L10. If the search cannot be completed, or no suitable configurations exist, the search will time out at Stage L10, and fallback poses will be used. If any suitable pose configurations are found during the search, they are transmitted back to Stage L10. A very simple hill-climbing algorithm [155] is then used to evaluate intermediate configurations until an internal time limit is reached. This hill-climbing method is based on the combined visibility of a given pose set, for cameras: (3.28) In Equation (3.28), is the combined visibility metric for the camera, and is the combined visibility metric for one potential pose set. If there are total potential pose sets which satisfy the rule set, for each iteration of the hill climbing method a new weighted and fused pose set is defined: (3.29) In Equation (3.29) above, is the new, fused pose set, and is the potential pose set determined by the search in Stage L8. is the weighting factor, the fused visibility for the pose set, as determined by Equation (3.28). This fused pose set is visibility-evaluated, and rulechecked. If it satisfies all rules in the rule set, it is added to the list of potential pose sets for the next iteration. In addition, for every iteration, regardless of if a potential pose set is added, the pose set

107 82 with the lowest fused visibility is removed. Iteration continues until (i) there is only one potential pose set remaining, or (ii) an internal time limit is reached. It is important to note that this method is not guaranteed to reach a global or even local maximum. It is purely a last-resort measure for the system to provide a pose solution that is better than default fallback poses. This is why extensive use of rule-sets should be avoided; they must be implemented late in the pipeline, and as such do not have sufficient time to implement a complete search. This search may also be implemented to improve partial results from Stage L8 if the primary optimization process is interrupted before completion. Fall-back Poses As mentioned above, if the initial poses selected by Stage L8 fail the global rule check and Stage L10 is unable to find an acceptable secondary solution, or if Stage L8 and Stage L10 cannot find any solution which satisfies all rules, fall-back poses must be used. This stage is responsible for maintaining these fall-back poses. In general, their selection will be system-dependent, although several basic options cover the majority of cases: (i) translational-static, (ii) completely static, (iii) n- instant rollback, (iv) off-line rollback, (v) search pattern, (vi) best-guess. Options (i) and (ii) cover the case where the cameras are fixed to be partially or completely static for the next instant. Specifically, for Option (i), camera translation is fixed, but rotational dof will continue. In Option (ii) all dof are fixed. The assumption is that moving sensors blindly may adversely affect future visibility, so it is better to wait for one instant by having the system remain partially or fully static. The danger is that the situation which caused fall-back poses may not resolve immediately, or at all, necessitating a check to see if the system is stuck. Option (iii) moves the system to poses as close as are feasible to those selected for the instant occurring n instants ago. The idea is to roll back a poor pose decision in the previous instant which caused the use of fallback poses. Option (iv) implements a similar processes, except cameras return to poses which are as close as possible to the initial camera poses determined through off-line reconfiguration. Option (v) implements an independent search pattern, where the cameras are sequentially swept through their range of motion this behavior is useful for when the system has not positively detected the OoI. Finally, Option (vi) simply moves the cameras to the potential pose set with highest fused visibility, regardless of other rules or conditions. This is useful for cases where Stage L8 completes operation, but cannot find any pose sets which satisfy the rules in Stage L10. This stage may implement more than one method of selecting fallback poses, depending on the task at hand. Application-specific heuristics can be used to select the best method possible on a per-instant basis.

108 83 Motion Control Lastly, this stage is responsible for motion control of the physical system. Depending on the hardware in use, it may implement direct motion control, or an external controller may be used. In this case, Stage L10 must keep track of the external controller s status and ensure that the feasible ranges calculated by the system are accurate, given the actual hardware s real-world state. This is the output stage in the pipeline; physical motion is emitted to the environment. The environment, including the sensors, will change in response, which will in turn be quantized at Stage L1, completing the loop. 3.4 Summary This chapter introduced the proposed customizable sensing-system reconfiguration method, which is designed to address the carefully constructed sensing problem presented in Chapter 2. The proposed framework itself is based on a ten-stage pipeline architecture, with a primary goal of real-time, realworld operation. The use of a pipeline allows the system to achieve a higher average update rate than an equivalent, non-pipelined system. Pipeline theory was presented which explains the selection and characterization of the pipeline stages, limits on its performance, the structure of a pose update, and the general operation of the pipeline itself. Lastly, the individual stages of the proposed pipeline structure are examined in detail. This includes a description of their function, as well as options for customization by a system designer, and reference implementations. The pipeline begins with Stage L1, the Imaging Agent, which asynchronously captures images from the system s sensors. These images are filtered, corrected, and adjusted to improve their quality and remove distortion. Non-useful sections of the images are removed through interest filtering, and the result is stored. Stage L2 synchronizes the images from all sensors to a single world time using time stamps and an image interpolation method. Stage L3 detects and tracks points of interest within these images in un-distorted pixel coordinates. Stage L4 removes the projection model of the camera and system, transforming pixel coordinates of detected features into normalized camera coordinates. These coordinates form complex constraints in world coordinates. Stage L5 uses these constraints in a calibration-model-based method to solve for points of intersection. These points are the world coordinates of detected features. Form recovery occurs in Stage L6, where feature points are identified and iteratively fitted into an object form feature vector using a method which combines a priori information, including model constraints, and the current OoI action. The Prediction Agent, Stage L7, estimates future object poses and OoI forms using a predictive filter and knowledge of the current OoI action. The Central Planning Agent, Stage L8, uses all of this information to perform the

109 84 constrained optimization outlined in Chapter 2. The result is a potential sensor pose solution for the current demand instant. Stage L9 implements the vision payload, recognizing actions from the stream of OoI forms estimated by Stage L7. Finally, Stage L10 enforces global rules and maintains fallback poses for the system, while also implementing motion control. Together, this proposed method provides a comprehensive implementation strategy which can be adapted to a wide variety of articulated TVG object action sensing tasks. Future chapters will examine the experiments which led to the creation of this framework, and will verify and characterize its performance in real-world sensing tasks.

110 4. Single-Action Sensing for Human Subjects The most basic task defined in Chapter 2 for a time-varying-geometry (TVG) object action-sensing system is the recognition of a single object of interest (OoI) action. This Chapter will present the details of a method developed for recognizing a single TVG object action in real-world environments. This method is a precursor to the formal, real-time framework developed in Chapter 3. Simulated and real-world experiments are used to characterize the formal problem definition and identify areas for improvement in the methodology. The issues raised are later used as part of an iterative re-design process to develop the final, formal methodology in Chapters 5 and 6. This Chapter will begin with an explanation of the initial sensing methodology in Section (4.1), which is based on past work. Novel changes to the methodology and the initial problem assumption are verified through a custom simulation environment, developed in Section (4.2), and simulated experiments in Section (4.3). Results from these simulations are used to develop real-world, quasi-static experiments in Section (4.4). These experiments identify valuable additional information about the effect of obstacles, path prediction, OoI actions, self occlusion, and other factors on the action sensing problem, leading to an iterative improvement in the methodology in Chapter Initial Single-Action Sensing Methodology The detailed problem formulation presented in Chapter 2 is the result of significant literature review and feasibility study. However, this problem covers a wide-range of subject areas, increasing the potential for design error if a complete framework had been created directly, with no further study. To prevent such errors, a reduced problem formulation was devised: single-action recognition for TVG object in simulated and real-world environments. This problem relaxes several of the requirements in the more rigorous general TVG action sensing problem. The object is restricted to a single action, eliminating the need for complex schemes to address multiple, simultaneous actions. As part of rigorous testing and characterization of the problem space, a range of obstacles and other real-world difficulties were tested. Also, the requirement for real-time operation was relaxed while this system was designed to run in real-time, if possible, it did not employ a formal real-time framework. Instead, the system was designed around a previous architecture for fixed-geometry action sensing, the Central Planner Architecture Central Planner Architecture This architecture, the Central Planner Architecture, is an agent-based architecture which uses a set of agents with distinct functions to achieve the sensing goal [156]. Agents communicate amongst 85

111 86 themselves, and with a Central Planning Agent (CPA), to form aggregate behavior which completes the sensing task. The CPA is the core of this method, and is directly responsible for the sensor planning task and moderating the communication between other agents. While this structure can potentially operate in real-time, especially if individual components are designed with real-time operation in mind, a framework based on this method would have several difficulties to overcome. Firstly, the intercommunication process is not formally bounded; with no direct control from the CPA, agents may communicate as often and as long as desired. It is difficult for the CPA to enforce rigid deadlines on operation, due to the distributed nature of the task. It is also difficult for the system designer to characterize its real-time performance, for the same reasons. These limitations will be explored through experiments in Sections (4.3) and (4.4). An overview of the basic CPA-based architecture used in the simulated and real-world experiments in this Chapter is given in Figure 4.1. FIGURE 4.1- OVERVIEW OF PROPOSED SINGLE-ACTION. CPA-BASED ARCHITECTURE This architecture is based on past agent-based methods, such as [61] and [90]. Each agent in the architecture has a specific, designated function. However, the workload in the system is purposely imbalanced to assign most of the core operations to the CPA; in this manner, it will have greater

112 87 control over the timing of the system. Communications between agents are asynchronous, although the CPA enforces a strict pose-decision deadline at the end of each Demand Instant. Sensor Agents The sensor agents operate at the lowest level of the sensing system, and each sensor agent is associated with a camera present in the given physical system. The exact configuration (in terms of number and composition of the sensor set) is determined through an established method of off-line sensing-system reconfiguration, [83]. It is assumed that each camera is reconfigurable in terms of its pose, and that each is limited in capability by positional and rotational velocity and acceleration: (4.1) (4.2) (4.3), (4.4), (4.5) The above Equations (4.1) to (4.5) govern sensor poses for two immediately adjacent Demand Instants, the current Demand Instant and the next Demand Instant. Thus, is the initial pose of the sensor for the current Demand Instant, is the final pose selected by the system for the next Demand Instant, is the current time, is the final time for the start of the next Demand Instant, is the total time between the Demand Instants, and L min/max are the outer limits of the motion axis. Similarly, and are the minimum and maximum poses achievable, respectively, given the capabilities of the sensor, the current pose, and the time remaining. This definition is similar to the final definition proposed in Chapter 3, although a specific motion model is not yet included, to allow for generality. Instead, Equations (4.4) and (4.5) define an arbitrary function for the mapping the function will depend on the model of motion being used. A similar set of equations is used to determine the rotational limits in terms of angular velocity and acceleration. The final pose space is discretized into possible final positions, where: (4.6) In Equation (4.6), is the average pose set evaluation rate of the CPA, and is a time penalty representing the overhead of the system. This is done to bound computational complexity, allowing the CPA to evaluate the entire pose space when performing the constrained evaluation. A continuous

113 88 algorithm that is limited in iterative depth is later used to replace this discrete method. For each discretized sensor pose, a visibility metric is evaluated on a per-sensor basis to perform the actual optimization outlined in Chapter 2. Under this simplified problem definition, all known obstacles, and the OoI, are modeled as elliptical cylinders. A clipped projection plan is established, and all objects are projected onto this plane, as shown in Figure 4.2. FIGURE 4.2 CLIPPED PROJECTION PLANE FOR VISIBILITY METRIC CALCULATION A sorting algorithm is used by each sensor agent to produce an ordered list of poses sorted from highest to lowest visibility, which is passed to the central planner. The visibility metric itself will be discussed in detail in Section

114 89 Pose-Prediction Agent The Pose Prediction Agent functions similarly to Stage L7 in the proposed real-time framework found in Chapter 3. However, subject form prediction is performed separately by the Action Recognition Agent; in this simplified methodology the Pose Prediction Agent only predicts the world-coordinate pose of all objects and the OoI. The simulation and experimental implementations use a basic Kalman Filter (KF), as per the reference implementation in Chapter 3. However, due to the simplified, linear motion paths, the initial system uses only first-order state variables: (4.7) Equation (4.7) uses the same variable naming scheme as Equation (3.21) in Chapter 3. Later experiments which examine the effect of pose prediction modify this agent to use second-order state variables. Referee Agent This agent is functionally similar to Stage L10 in the final customizable framework, details of which are found in Chapter 3. It is designed to ensure that global rules are not violated. These global rules are constraints imposed on the overall system behavior that are not captured directly by the optimization problem, or by the specifications of the other system agents. For the case of the following experiments, these rules are application specific, and will be discussed prior to the relevant experiment. As an example, a rule is defined during the initial simulated experiments to guarantee the assignment of a minimum number of cameras at each Demand Instant to the surveillance of the OoI. Later real-world experiments expand the required functionality for this agent to include fall-back pose selection and system monitoring. The ability of this agent to act on pose selections which violate global rules once they are detected is also improved through these experiments. Form-Recognition Agent The form recognition method implemented by this agent is model based, and is the precursor to the method developed in Chapter 3. The OoI form is stored as a feature vector derived from geometrical data the feature vector consists of a list of interest point locations on the OoI, relative to an origin point on the object. For the experiments that follow, the OoI will be a human, and the reference point will be the center of the head. The system is able to determine the location of this reference point and

115 90 all other points of interest in world coordinates. Two different methods of feature point detection and tracking are used in the simulated and real-world experiments. The simulated experiments use a color-segment method, described in detail in [64]. Later real-world experiments use a method similar to the reference method for Stage L3 in Chapter 3, although with color marker-based point detection and Optical Flow (OF) tracking, [76]. Action-Recognition and Prediction Agent The approach to Action Recognition is almost identical to the method detailed in Chapter 3. In order to recognize the current OoI action, the sensing system identifies two distinct target forms, referred to as the start and end frames. A Euclidian distance metric is used to determine which action set the start and end frames belong to. However, initial experiments do not calculate a time normalization constant, and instead assume actions to be fixed length and monolithic, for simplicity. The addition of time normalization for variable-length actions is addressed in Section (4.4). This agent is also responsible for generating predictions of the future subject form. Unlike the combined method presented in Chapter 3, this agent can only predict form if an action is already positively recognized. In this case, the action library is used directly to interpolate a subject form based on the current time, relative to the start of the detected action CPA, Optimization, and the Visibility Metric The only agent not covered in the previous section is the CPA itself, which is the core of this methodology. The CPA accepts the sorted visibility evaluations from each sensor agent, as well as a list of the discretized, feasible, and achievable poses. Using this information, predictions of future object poses, and the predicted OoI form, the agent generates camera assignments and selects final sensor poses for the Demand Instant at hand. A fixed time horizon of three Demand Instants is used for this method, represented by three times; immediate future Demand Instant, and / for the next two Demand Instants after. From the current time, the system has until the next Demand Instant in the horizon,, to make a final pose decision ( is effectively the end of the current Demand Instant). Then, the system has until to complete any camera motion specified by the reconfiguration process. However, the system may choose to allow some cameras to move on a long-term plan, in an attempt to improve future visibility. Cameras which complete movement before are assigned to the next Demand Instant, cameras which continue to move are unassigned. This division is shown graphically for an example three/one assignment (three cameras assigned, one camera unassigned ) in Figure 4.3.

91 FIGURE 4.3 OVERVIEW OF SENSOR ASSIGNMENT FOR INITIAL METHODOLOGY This method approximates the more complete optimization specified for the real-time framework.

116 91 FIGURE 4.3 OVERVIEW OF SENSOR ASSIGNMENT FOR INITIAL METHODOLOGY This method approximates the more complete optimization specified for the real-time framework. Although, by fixing the prediction horizon, the system is inherently limited in its ability to form longterm plans to address large obstacles or significant occlusion. The limitations of this fixed-horizon method are explored in the simulated and real-world experiments that follow. A simple set of rules is used to select a subset of the cameras to service the OoI at the next Demand Instant: Cameras with a maximum achievable visibility metric less than a specified minimum,, are unassigned. These cameras are expected to contribute the least to the sensing task at the current Demand Instant, and should be the best candidates for future gains in visibility through continued motion. Of the remaining cameras, a maximum of cameras are assigned (those with highest achievable visibilities), and all others are unassigned. This is to ensure that the system will attempt to plan for future visibility. In some cases, can be set to allow all cameras to be assigned to the next Demand Instant. Sensor agents for unassigned cameras are asked to re-evaluate the visibility metric for additional Demand Instants. They are moved in anticipation of potentially optimal viewpoints at Demand Instants farther into the future. They may still participate in the sensing task at any time. For assigned cameras, a weighted sum of metrics is evaluated. This sum includes the base object visibility and a measure of importance of the view (in terms unique data about the object). Namely, feedback from the form-prediction agent about which sub-parts of the OoI are not currently well represented in the dataset is also included. This metric was added in response to real-world experiments in Section (4.4).

117 92 Initial simulations further restricted this process to a fixed assignment of three cameras to the immediate future Demand Instant, and one camera unassigned, as shown in the example figure. The real-world experiments use the complete method above, with 3. The referee agent is also used in this case to implement and enforce a rule that at least two cameras must be assigned to the current Demand Instant. As mentioned above, the choice of visibility metric is central to the complete pose selection process. The metric chosen for single-action recognition combines previous knowledge of fixedgeometry objects, with the issues identified in Chapter 2. The goal is to create a visibility metric which satisfies the condition put forth in Chapter 2: increasing visibility must never decrease performance in the sensing task. If this condition is satisfied, the system may directly optimize visibility, rather than performance. To begin, objects in the environment are modeled as elliptical cylinders. The clipped projection plane, shown in Figure 4.2, is used with this object model as part of the visibility metric calculation. The base metric,, is defined as follows: (4.8) cos (4.9) max, (4.10) In Equation (4.8) above, three distinct visibility sub-metrics, distance, angle, and visible area, are combined using an application-based weighting scheme, determined by the constants,, and, respectively. Each of these sub-metrics attempts to address specific qualities that the sensing method expects from useful views of the OoI. The distance metric prefers views with maximum detail, where the camera is physically close to the OoI. The angle metric prefers views with the OoI centered in the view. Finally, the area sub-metric prefers views which provide un-occluded visibility of the OoI. These sub-metrics correspond to the basic qualities of a useful view for TVG action sensing identified in Chapter 2. The area sub-metric defines two tangent points to the bounding ellipse, and, which each lie on a separate ray intersecting the focal point of the camera. The line between these two points is divided into segments by projecting any occluding obstacles onto the line, as shown in Figure 4.4.

$4 UN-OCCLUDED AREA CALCULATION FOR INITIAL VISIBILITY METRIC For these segments, the segment has length, so the area sub-metric becomes the fraction of the line from to which is visible for a given$

118 93 FIGURE 4.4 UN-OCCLUDED AREA CALCULATION FOR INITIAL VISIBILITY METRIC For these segments, the segment has length, so the area sub-metric becomes the fraction of the line from to which is visible for a given pose. The effect of fore-shortening is purposely not removed from this calculation. The distance sub-metric is given by the straight-line distance from the camera focal point,, to the object center of mass (CoM),. It is normalized by the maximum possible length of this line,, which must be calculated on a per-instant basis. Finally, the angle sub-metric is found using Equation (4.9), which calculates the angle between the line containing the camera focal point,, and the object CoM,, and the camera view center line, which contains the camera focal point and the camera center of rotation,. This is normalized by Equation (4.10), which determines the maximum angular difference given the upper and lower limits of the camera pan and tilt angles, / and / respectively. This baseline formulation of the visibility metric will be evaluated and characterized to determine areas for improvement. To do so, an environment in which one can rapidly implement and evaluate new ideas is necessary; all assumptions about this visibility metric and its use in the optimization process must be rigorously tested and verified in a controlled setting. 4.2 Simulation Environment To test and benchmark the initial sensing-system reconfiguration methodology, a virtual environment, or simulator, was created. The use of a simulation environment allows for complete control over critical system variables, including environmental factors such as lighting, camera models, and image

119 94 quality, and system inputs, such as object poses, camera poses, and OoI form. Complete control over the environment is both a benefit and a drawback; during the early stage of the design process, the problem at hand was not completely characterized. The simulation must be designed to be as complete as possible to accurately reflect the real world. However, to do so, knowledge of the problem at hand is needed a cyclical requirement. As such, the simulation environment is used in this research as a method of continuous verification; assumptions about the behavior of the system in response to the assumed behavior of the environment are tested. The results are used to verify and improve these assumptions, but the results themselves are always verified by equivalent real-world experiments. The simulation environment was continually updated over the course of this work; any real-world experiment can be accurately replicated in this environment. As a result, a powerful tool is available to the system designer, and the system itself; environmental simulation forms the basis for calculating visible OoI area in later Chapters Object Modeling Objects in the simulated environment are modeled using the surface mesh representation described in Chapter 2, Section (2.1.1). This representation is inherently suitable for most computer graphics engines, which are vertex-based. The actual modeling process, including the associated software, is omitted for brevity. Models are chosen to accurately reflect real-world material properties, such as reflectivity, whenever possible. All models are textured and lit using captured real-world textures and environmental lighting models. Some sample models are shown below, in Figure 4.5. FIGURE 4.5 SAMPLE SIMULATED OBJECT MODELS

120 95 While it is impossible to determine the exact effect of each object on the simulation s accuracy, in general, objects near sensors require more detailed representations than far objects. A three-tier scheme is used for efficiency. Near objects are represented by a high-tessellation model with full surface modeling. Mid-range objects use a low-tessellation model with basic lighting. Far objects use a simple bill-boarding technique. Objects are categorized automatically per-frame using a sorted depth system, a common method in computer graphics [157] Sensor Modeling The physical cameras in a sensing system also significantly impact the raw image available to the system. Although the simulated environment has the ability to define arbitrary projections and virtual cameras, it is useful to model real-world cameras as well. The chosen real-world camera calibration method combines a camera lens-distortion model, [140], with traditional projection matrices, [141]. The projection matrices can typically be implemented directly in the graphics sub-system, but the lens distortion model cannot. To implement the lens surface, real-world distortion constants are used to form a virtual lens surface. The scene is first rendered to an off-line texture in memory, and this texture is used to texture-map the virtual lens surface. The result, depending on tessellation, accurately reflects real-world camera images. All virtual system cameras use real-world calibration data captured from the actual system cameras Target Model The final component of the simulation environment is the OoI model itself. Although a complex appearance-based model can be used, the following experiments use a model based on geometric primitives for simplicity. This model implements the bounding-cylinder method for visibility metric OoI visible area calculation directly, eliminating the need for a separate fitting and modeling step. Since the OoI for all experiments is a human subject, a 9-joint, 14-dof presented (similar to [95] and [45]) was chosen as the system OoI model. This model can accurately represent a significant number of human actions, but uses a minimum number of dof. These choices do not preclude the use of other human models or other TVG objects, so long as a suitable articulated model is available. The model s structure, plus real-world and simulated human figures, are shown in Figure 4.6.

121 96 FIGURE 4.6 (LEFT) SKELETAL MODEL (MIDDLE) HUMAN FIGURE (RIGHT) SIMULATED FIGURE 4.3 Simulated Experiments The goal of the first simulated experiments is to verify basic assumptions about the visibility metric, and its effect on performance. In doing so, the experiments will also confirm that using sensingsystem reconfiguration does indeed improve performance over static cameras. The first experiment will examine the direct effect of varying levels of sensing-reconfiguration capability on a single human walking motion. A second family of experiments will examine the effect of pose prediction on the basic operation of the active-vision system. Additional experiments will explore TVG-specific issues and their effect on the visibility metric, such as the effect of self-occlusion. Results from these experiments are incorporated into the initial methodology for iterative improvement Experimental Set-up The simulation environment is designed to accurately reflect the physical active-vision system which will eventually be used for real-world experiments. As such, all object models are based on corresponding real-world objects. The layout of these objects is also determined by the physical system, including initial sensor poses (determined through an off-line reconfiguration algorithm). An overview of the simulated environment is shown in Figure 4.7.

97 FIGURE 4.7 OVERVIEW OF SIMULATED ACTIVE-VISION ENVIRONMENT The sensing system is comprised of four Sensor Agents, with associated physical sensors.

Four physical cameras were calibrated, and the results were used to build corresponding virtual cameras.

122 97 FIGURE 4.7 OVERVIEW OF SIMULATED ACTIVE-VISION ENVIRONMENT The sensing system is comprised of four Sensor Agents, with associated physical sensors. The virtual cameras associated with these agents are based on real-world cameras from the Logitech QuickCam Pro 9000 series. Four physical cameras were calibrated, and the results were used to build corresponding virtual cameras. All other aspects of the virtual system and environment correspond to the physical environment in later experiments: The environment has dimensions of All objects which fall within this area in the real-world environment are modeled using high-tessellation textured models, as described in the previous section. Distant objects use low-tessellation models or billboards.

123 98 The cameras have a 45 field-of-view, approximately. Cameras are positioned such that their focal point is on their central axis of rotation. Virtual cameras capture pixel images in raw RGB pixel format. All four cameras have a rotational dof with 180 of travel. Two of these cameras have an additional translational dof with 500 of travel. The simulator can simulate real-world rotational and translational capabilities for all cameras. It is also able to simulate ideal reconfiguration, where sensors can be instantaneously moved to any desired feasible location. Specific obstacles in the environment are modeled as opaque cylinders of varying size. These obstacles can be positioned and moved in any desired manner. The real-world setup was created to be approximately 1:6 scale. The simulation method maintains this scaling to allow results to be compared directly. Experiment 1 The first experiment is designed to verify that (i) the use of sensing-system reconfiguration can improve sensing-task performance over the use of static cameras, and (ii) the capability of the system to reconfigure itself is correlated to sensing-task performance. For this experiment, ideal object pose prediction is used (simulating a perfect prediction method). No feedback about the current OoI form or action, including action predictions, is used in the optimization. The environment consists of the basic cluttered environment, plus four 50 diameter, cylindrical, static obstacles. The OoI, a human analogue model, moves through the center of the workspace on a linear path at a constant velocity of 100 /. During this motion, the human continuously performs a single walking action. The locations of the obstacles and the OoI path are shown in Figure 4.8.

99 FIGURE 4.8 OBSTACLE LOCATIONS AND OOI PATH FOR SIMULATED EXPERIMENT 1 The trial itself compares three levels of system reconfiguration ability on the basic sensing problem.

124 99 FIGURE 4.8 OBSTACLE LOCATIONS AND OOI PATH FOR SIMULATED EXPERIMENT 1 The trial itself compares three levels of system reconfiguration ability on the basic sensing problem. These levels are defined as follows: Static Cameras All cameras in the sensing system remain at the initial poses, selected through offline reconfiguration, for the entire trial. These poses are selected to provide optimal average-case performance by the off-line algorithm, and as such, alternate static poses are not examined. This is a baseline trial subsequent trials using a reconfigurable system should show a tangible improvement in performance. Velocity-Limited Cameras The second trial of this experiment uses cameras limited to a rotational velocity of 0.35 /, a rotational acceleration of 0.70 /, a translational velocity of 45 /, and a translational acceleration of 900 /. This is to simulate a real-world system which is limited in reconfiguration ability.

125 100 Ideal Cameras The final trial of this experiment uses ideal cameras which are unconstrained in their reconfiguration ability. These cameras may select poses anywhere within their feasible range at any time. This trial examines the upper limit on the performance gain from using active vision. All trials with dynamic cameras use the same initial poses as the static system. Sensing-task performance is evaluated by the total, absolute error in the recovered subject form. A measure of this error is defined as an error metric,, for each Demand Instant in the simulation: (4.11) In Equation (4.11), is the length of the feature vector, is the number of dimensions positively detected and assigned by form recognition, is the number that are detected but not assigned, and is the number missing (not detected). The sum finds the absolute difference between each assigned feature-vector dimension, and each corresponding library value,. Note that these are dimensions, not feature points, so a feature vector consisting of world-coordinate locations for OoI feature points would have three dimensions for each feature point. The error metric is essentially the percentage of the OoI form which has been recovered. Missing and unassigned points are considered 100% erroneous, while detected feature dimensions have their true error calculated as a percentage difference from the correct value. Thus, poor estimations may produce more than 100% error misleading information is considered worse than no information in this sensing task. The following experiments establish an upper limit value for the error metric. For Demand Instants with a calculated error metric value greater than 0.25, the error in the estimated form is considered too great for the system to positively recognize the form. Although the chosen action recognition method may be robust to one or more such Demand Instants, it is desirable to minimize the error metric and provide maximum possible information to the action recognition method. This, in turn, gives the process the best chance of succeeding. This value is determined through statistical analysis of multiple trial runs; the chosen value was found to result in at least a 95% true positive form recognition rate for un-rejected frames. Experiment 2 The second experiment is designed to characterize the effect of real-world pose prediction on the sensing task. In particular, the goal is to (i) verify that the system will work when pose estimation is used, (ii) verify that the system is robust to increasing error in estimated poses, and (iii) characterize system stability and performance with increasing prediction error. All system capabilities are set to

101 those used for the Velocity-Limited Cameras trial in Experiment 1. The system layout is also similar, with identical initial sensor poses and rotational/translational dof.

126 101 those used for the Velocity-Limited Cameras trial in Experiment 1. The system layout is also similar, with identical initial sensor poses and rotational/translational dof. Two obstacles are used instead of four; they may be static or dynamic, depending on the trial. Their static poses and dynamic paths are shown in Figure 4.9. FIGURE 4.9 OBSTACLE LOCATIONS AND PATHS FOR SIMULATED EXPERIMENT 2 Pose prediction is provided by the Prediction Agent described in Section (4.1). All objects in the environment must be tracked, including the OoI. As such, pose prediction is necessary for multiple objects. These objects may be static or dynamic. Four trials are used to examine the effect of pose prediction: Static Obstacles, Ideal Prediction As in Experiment 1, the first trial establishes a baseline for comparison. Ideal prediction is used to predict the OoI path only. Dynamic Obstacles, Ideal Prediction This trial establishes the effect of dynamic obstacles on the baseline system performance. Again, ideal prediction is used to predict all object paths.

127 102 Static Obstacles, Real Prediction The first trial is repeated using real-world KF-based prediction of the OoI path. If the system is robust to the error inherent in the pose estimations, the resultant error metric values should be comparable to the first trial, with ideal prediction (i.e. the system should select nearly identical poses, resulting in similar error values). Dynamic Obstacles, Real Prediction The second trial is repeated, again using real-world prediction of all object paths. Results should be comparable to the second trial, ideal prediction of dynamic obstacles, if the system is robust to pose estimation error. For all trials, error metric values are calculated as in Experiment 1, and are subject to the same 0.25 upper limit for recognition Simulation Results Experiment 1 The first experiment consists of 100 Demand Instants of a walking motion, as outlined in Section (4.3.1), Experiment 1. For each Demand Instant, the error metric is calculated from the recovered feature vector. The results for each of the three runs are shown in Figure The upper threshold of recognition is given by the red line on the graph.

103 FIGURE 4.10 COMPARISON OF ERROR METRIC OVER 100 SIMULATED DEMAND INSTANTS The results show a clear overall reduction in error values with the use of sensing-system reconfiguration.

128 103 FIGURE 4.10 COMPARISON OF ERROR METRIC OVER 100 SIMULATED DEMAND INSTANTS The results show a clear overall reduction in error values with the use of sensing-system reconfiguration. In particular, it is immediately apparent that many Demand Instants for the static camera sensing trial are above the upper threshold for recognition. Velocity-limited or ideal reconfiguration allows the system to positively recognize the OoI form at all Demand Instants. Static Cameras - For the case of static cameras, performance was poor from Demand Instants 1 to 20, and 80 to 100, since the object is near the limits of the workspace. Demand Instants 32 to 45 showed that an obstacle would also cause most of the object form to be unrecoverable. These results verify the most basic assumptions of the methodology; static cameras are significantly affected by occlusions and viewing angle. If a limited number of cameras are available, there will often be dead zones where there is insufficient sensor coverage. Although these can be eliminated by adding additional sensors, the number required may quickly become impractical for complex environments. Furthermore, even for Demand Instants with relatively un-occluded views of the OoI, such as Demand Instants 21 to 31 and 46 to 79, there is a difference in error metric values between static and dynamic cameras. Even when providing un-occluded views, static sensors may not provide the best possible views for the sensing task.

129 104 Velocity-Limited Cameras - The use of constrained sensing-system reconfiguration lowered overall error, as the majority of frames are now considered to be positively recognized. Significant error still remained at Demand Instants 31 to 41, where the system was unable to move to its best sensor positions due to movement-time constraints. This is the nature of a real-world sensing system. Without infinite motion capability, the system may not be able to achieve the best possible poses, given the current system state. Similarly, without perfect, complete a priori knowledge of object paths, the system may select poses which, while (near) optimal for the next Demand Instant, may worsen the available poses available at later Demand Instants. This effect is evident at Demand Instants 56 to 59. For these Demand Instants, the system chooses the best pose from those achievable, but the previous reconfiguration choices have resulted in a poor set of choices - poses too far from the ideal set for the system to achieve them before several Demand Instants have passed. As a result, the algorithm could not recover some portions of the model for these frames. Ideal Cameras - The ideal run showed strong correlation to the velocity-limited case, when the performance-optimal pose for a given Demand Instant was achievable by the velocity-limited system. For Demand Instants where previous reconfiguration choices or motion time constraints limited the achievable poses to a poor set of alternatives, there is a definite improvement consider Demand Instants 30 to 41 and 56 to 59. However, there were still some cases where sub-sections of the object could not be recognized due to missing data (e.g., Demand Instants 34, 44, 54, and 59), resulting in a higher than average error. Future experiments must confirm if the selected poses are simply the best possible under the given circumstances, or if there are better poses, but the system could not differentiate them. It is useful to consider one frame in the previous experiment to illustrate this point. Sample Demand Instant Let us consider Demand Instant 85 in the previous trial, where a positive form match was determined by the ideal algorithm, but significant error in the recovered model still exists. Examination of the trial shows that this is the result of the system rejecting incomplete data on one or more model subsections and, thus, not recovering that portion of the model. Specifically, as shown in below in Figure 4.11, the left arm of the subject is not visible.

130 105 FIGURE 4.11 CAMERA VIEWS FOR DEMAND INSTANT 85, SIMULATED EXPERIMENT 1 As mentioned in Section (4.1), the proposed initial algorithm combines three metrics to rank its desired poses. A plot of these metrics, plus the weighted combination metric, is shown for Camera 1 in Figure 4.12.

arm of the subject would be visible and, thus, it would be recovered. This pose would likely have a translational d value closer to the maximum of 1.6.

131 106 FIGURE 4.12 VISIBILITY METRIC CALCULATION FOR DEMAND INSTANT 85, EXPERIMENT 1 One can note that if Camera 1 (and Camera 3, as well) were to select a non-optimal pose under the current rules, then, the left arm of the subject would be visible and, thus, it would be recovered. This pose would likely have a translational d value closer to the maximum of 1.6. However, such poses cannot be distinguished from other, less informative poses using any of the current metrics, Figure In this specific case, the current target distance metric would act against such poses. The current metric does achieve its design; the selected poses are occlusion-free, center the target in the image, and are physically close to the OoI. Therefore, this situation indicates that the chosen metric has not completely captured the issue at hand. The addition of a fourth metric, which can differentiate viewpoints that contain unique information from those that do not, could potentially solve this problem. The only way to determine

132 107 (without a priori knowledge of the target s actions) which views are likely to contain unique data is to predict the form of the object at a future Demand Instant. As such, feedback from the form-prediction process will be added to the proposed methodology. To do so, the OoI must no longer be modeled solely by its bounding elliptical cylinder in the visibility metric. The true projected area of the OoI must be used, instead. Furthermore, each visibility calculation must be weighted by the amount of this surface area which is unique among the proposed set of sensor poses. These iterative changes will be discussed in Chapter 5. It should be noted that these Demand Instants make up a small fraction of all Demand Instants; the use of an active-vision system has indeed been confirmed to improve sensing-task performance over a system with static cameras. In addition, this experiment has also confirmed that system reconfiguration ability is positively related to the performance gain from using active vision, as hypothesized. Thus, this experiment has achieved its two principal goals, and identified an area for iterative improvement of the proposed algorithm. Experiment 2 Given that Experiment 1 demonstrated the basic functionality of the proposed methodology, Experiment 2 was performed to characterize the effect of pose prediction on the method. Results for the four trials of this experiment are shown in Figure 4.13, Figure 4.14, Figure 4.15, and Figure 4.19, respectively. Demand Instant spacing has been increased to magnify the effect of prediction; shortterm predictions have a lower cost to the system if they are incorrect.

133 108 FIGURE 4.13 ERROR METRIC PLOT FOR IDEAL PREDICTION, STATIC OBSTACLES Static Obstacles, Ideal Prediction - The first trial presents results which are very similar to those found in Experiment 1. All obstacles in the system are static, and the subject follows the same trajectory and action sequence as before, with ideal prediction. As expected, the Demand Instants which are correctly recognized correspond closely to those from the previous results (considering the 4:1 ratio in time), as does the average value of the error metric, and its standard deviation. As before, there are some Demand Instants, such as Demand Instants 2, 14, and 19, where the form was positively recognized, but significant error remains. These are cases where some portion (an arm or leg, for example) of the model data was rejected by the form-recognition agent due to a poor estimate from the input images. The solution identified in the previous section can address this issue, where feedback from the recognition agent can be used to improve the input data for these unknown regions. All frames in this trial, however, were considered positively recognized, regardless of any missing data. The principal use of this trial is to provide a basis of comparison for the remaining trials.

134 109 FIGURE 4.14 ERROR METRIC PLOT FOR IDEAL PREDICTION, DYNAMIC OBSTACLES Dynamic Obstacles, Ideal Prediction - For this trial, the obstacles have been given a linear trajectory with a constant velocity. Prediction is assumed to be ideal for this trial, and thus a plot of the predicted locations has not been added to the graph. From Figure 4.14, one can see that the overall error metric has been reduced in some regions, notably the points highlighted in the previous trial as containing an unrecognized section of the model. This was due to fortuitous locations of the obstacles and subject at these frames, since the paths chosen for the obstacles are designed to exhibit fewer occlusions in the second half of the sequence. Regardless of this, all frames are still considered positively recognized, and the average error metric is not significantly different from the previous trial. Also, other Demand Instants, such as 6 to 11 and 21 to 24 show close correlation to the previous trial. This indicates that the reconfiguration methodology is able to effectively handle the changing locations of the obstacles in selecting sensor poses.

135 110 FIGURE 4.15 ERROR METRIC PLOT FOR REAL PREDICTION, STATIC OBSTACLES Static Obstacles, Real Prediction This is the first trial that introduces non-ideal prediction using the KF. Input to the filter for OoI pose prediction is taken from the form recognition agent. Input for obstacle position prediction (in the next trial) is taken from the ideal positions given by the simulation environment, plus simulated tracking error. The tracking results for the trial are shown in Figure 4.16, overlaid on the initial obstacle and subject positions. It is clear from the plot that prediction is continuous and closely correlated to the true OoI position.

136 111 FIGURE 4.16 OOI POSE ESTIMATION FOR STATIC OBSTACLE, REAL PREDICTION TRIAL Examining the error metric plot in Figure 4.15, one can see that for this trial, the overall error metric and sequence of recognized frames closely corresponds to that of the first trial. The addition of realworld OoI pose prediction has not significantly altered the poses selected by the system, or the resulting forms being estimated. However, examining the tracking results in Figure 4.16, one can see a slight deviation in the predicted path, and a corresponding increase in the recovered error metric at Demand Instant 17. This is useful for determining the limit of tolerance the system has for error in the input data for pose prediction, and for characterizing its mode of failure. This trial was thus repeated, only with additional Gaussian noise artificially injected into the estimated OoI pose before it is sent to the prediction agent. This noise was found through simple experimentation to be similar to real-world estimation noise. At a critical level of noise, the system fails at this Demand Instant (which already has higher-than-average error, due to the effect of the two static obstacles). Results for this case are shown in the error metric plot, Figure 4.17, and the tracking plot, Figure 4.18.

137 112 FIGURE 4.17 ERROR METRIC PLOT FOR TRACKING FAILURE TRIAL FIGURE 4.18 OOI POSE ESTIMATION FOR TRACKING FAILURE TRIAL

138 113 From the paths, one can note that as a result of a temporary occlusion around Demand Instant 17, and the additive Gaussian noise, a poor prediction of future pose is return to the system. In turn, the system selects poor poses, which further increase error at the next Demand Instant. Eventually, the system loses the OoI track entirely and becomes unstable. This uncovers a potentially incorrect assumption in the methodology. The work assumed that as long as the target is positively tracked, it is within the workspace and vice versa that is, if the system were to lose track of the target, the system assumes that it has left the workspace. However, this is clearly not the case here. Furthermore, the method assumed a relatively accurate a priori estimate of the subject s initial position, which in reality one may not have (it may be inaccurate, or even unavailable). As such, two methodology changes are needed in these cases: (i) the system must be able to initially search for a subject if no estimate is provided, and search to verify any initial estimates of pose given. (ii) If a subject is lost during surveillance, provisions must be added to both detect this case and terminate surveillance, or to re-acquire the subject. For this specific case, a weakness in the prediction agent implementation itself has also been identified, in that a highly erroneous observation for Demand Instant 17 has affected the state of the predictor such that the subsequent estimates it produces are not reasonable. This feeds back through the system, as without reasonable predictions of future positions, the system chooses poses for sensors that may not contain the subject at all. In essence the control loop is broken, and the system is out of control. Outlier suppression must be added to detect potential outliers at the input to the windowed KF and reduce their weight dynamically, thus also reducing their leverage on the estimated system state.

139 114 FIGURE 4.19 ERROR METRIC PLOT FOR REAL PREDICTION, DYNAMIC OBSTACLES Dynamic Obstacles, Real Prediction For the final trial, the obstacles have been assigned dynamic paths identical to that of the second trial (dynamic obstacles, ideal prediction). Now, however, prediction of the obstacle poses is not ideal, but instead is performed by the prediction agent. The path for this trial (and by extension, Trial 2) was chosen to highlight some of the issues identified in the previous sections. A plot of the error metric for this trial is shown in Figure As shown in Figure 4.20, the estimated and predicted poses for all three objects are closely correlated.

140 115 FIGURE 4.20 OOI AND OBSTACLE POSE ESTIMATION FOR DYNAMIC OBSTACLE TRIAL This is to be expected, as the previous trial showed a closely-correlated OoI track. In this trial, for roughly the first half of the Demand Instants, the obstacle paths have been designed to produce significant occlusions in Cameras 1 and 3. In previous trials, these frames were relatively unoccluded, exhibiting low error metrics and positive recognition matches. If the reconfiguration capabilities of the system were not beneficial (or were insufficient), then these frames would now experience significant error in the recovered form. As expected, there is a slight increase in the error metric for these Demand Instants, due to the increased occlusion present for these frames. However, all frames in this region are still considered positively recognized, and there is not a significant increase in the overall error. This indicates that the reconfiguration of the system has effectively dealt with the occlusions presented by the mobile obstacles, indicating that performance has indeed been increased over the worst case scenario of no reconfiguration and dynamic obstacles. Thus, it is valid (and useful) to apply reconfiguration here. Secondly, the close correlation seen in the overall error metric graph to the previous trials indicates that the uncertainty introduced by the real-world prediction method (the Kalman Filter) has not significantly influenced the proposed algorithm. Indeed, subject form at all Demand Instants is still recognized, with a low error metric value. By overlaying all graphs onto a single figure, Figure 4.21, one can see that all trials are closely correlated.

141 116 FIGURE 4.21 COMPARISON OF ERROR METRIC PLOTS FOR EXPERIMENT 2 As expected, the system has removed the effect of the static and dynamic obstacles on the form estimation process; the sensing task behaves as if no obstacles are present. Thus, this experiment has (i) confirmed that the system can address static and dynamic obstacles, (ii) verified the proposed methodology is robust to real-world pose prediction uncertainty, and (iii) identified additional areas for improvement for later real-world systems. 4.4 Summary This Chapter has presented a novel single-action sensing-system reconfiguration method designed specifically for sensing TVG objects and their actions. This method is the precursor to the customizable real-time framework found in Chapter 3. Due to its simplicity, it is useful on its own as a simplified methodology for use in low-cost, single-action applications. The proposed methodology is based around a central-planning architecture. This agent-based method uses distinct agents and a central planner to select system sensor poses through aggregate behavior. The individual agents are similar in functionality and division of labor to the later pipeline framework. A simulation environment, based on real-world objects and sensors is developed to test the system. This environment simulates all of the basic factors affecting the sensing process identified in

142 117 Chapter 2, while allowing individual variables to be controlled. Repeatable simulated experiments are proposed to verify the assumptions behind the proposed methodology, and to confirm the claimed increase in sensing-task performance over static cameras. The first experiment successfully verified that both tested forms of system reconfiguration, velocity-limited or ideal, improve sensing-task performance, represented in this case by lower recovered form error. Furthermore, the reconfiguration ability of the system has a direct impact on the gain in performance over static cameras. The experiment also identified that some Demand Instants require feedback from the action recognition process to allow the system to select the best possible poses. The second experiment characterized the effect of real-world pose prediction on the reconfiguration process. It was found that the proposed methodology is robust to the uncertainty inherent in the prediction process. Methods were also identified to allow the system to address instances where pose prediction was erroneous, or the tracking system failed entirely. The system was also able to remove the effect of static and dynamic obstacles on the sensing task. Overall, the novel method of sensing-system reconfiguration was shown to tangibly improve sensing-task performance over static cameras. The potential improvements identified through the experiments will be incrementally incorporated into the methodology as part of the iterative design process. The next chapter, Chapter 5, will use these changes and others to address multiple TVG object actions.

143 5. Multi-Action Sensing and Multi-Level Action Recognition As part of the simulated experiments presented in Chapter 4, several potential improvements were identified which could increase the potential benefits of using the proposed sensing-system reconfiguration method. These improvements are most applicable for situations which require the system to sense multiple sequential or simultaneous actions, due to the increased complexity of the problem at hand. In particular, this Chapter will address both simultaneous multi-action sensing and multi-level action sensing. Multi-level actions occur at different scales within the same time-varyinggeometry (TVG) object. For example, experiments will later be presented where the system must recognize a full-body human motion (walking) and a small-scale, precision motion (pointing). Multiple simultaneous actions follow the additive scheme defined in Chapter 2, where two or more atomic actions may be additively combined to form a new action. For example, experiments will also be presented where the system recognizes two random simultaneous human actions. Thus, this Chapter will begin by presenting an improved methodology which is specifically designed to sense multiple, potentially simultaneous TVG object actions, at multiple detail levels, in Section (5.1). In Sections (5.1) and (5.2), the proposed methodology will also address real-world operational issues, such as complete system calibration, due to the increased complexity of the problem at hand. Finally, real-world experiments will characterize the method s performance when sensing multiple simultaneous actions and multi-level actions, Section (5.3). 5.1 Multi-Action and Multi-Level Recognition Methodology The basic system structure is identical to the one proposed for single-action sensing in Chapter 4. It is based around a central planning architecture, where multiple agents communicate with a centralplanning agent to achieve the desired aggregate behavior. Multiple areas for improvement were identified in Chapter 4 as well. These changes and others will be examined in detail in this section. In particular, attention will be given to multi-part action sensing and real-world sensing issues. An overview of the agent structure is given below, in Figure 5.1. Although the basic structure remains unchanged, several key agents have been modified to account for the issues identified in Chapter 4. To begin, the sensor agents and central planning agent are modified to better account for multi-part and multi-level objects through an improved, multi-part visibility metric. 118

144 119 FIGURE 5.1 OVERVIEW OF MULTI-ACTION AND MULTI-LEVEL ACTION METHODOLOGY Sensor Agents and Central Planning Agent In the previous Chapter, the base visibility for an object of interest (OoI) was calculated as a weighted sum of three sub-metrics measuring distance to the camera, angle to the camera, and visible area. However, all objects were modeled as simple elliptical cylinders, allowing the visible area to be calculated using a simple straight-line distance, based on the un-occluded arc-length in the 2-D camera plane. While useful for the most basic case of single-action sensing, it was discovered that this representation has two key weaknesses. First, it cannot accurately portray the complexity of a true articulated object; different parts of the volume enclosed by the elliptical bounding cylinder contain different parts of the OoI. True projection of the actual object, rather than a simplified representation, is needed to determine (i) what sub-parts are visible, and (ii) the true surface area of the OoI itself which is visible. Secondly, it does not distinguish which sub-parts of the OoI are most important, given the current state of the action-recognition process. To address these concerns, a modified sensor agent structure is proposed. As before, each sensor agent is associated with a single camera in the physical system. The sensor agent encapsulates all 6-dof (degree-of-freedom) motion capabilities of the associated camera (positional and rotational velocity/acceleration), and manages its current pose over time. It ensures

145 120 that all poses selected for and examined by the sensor agent are both feasible and achievable, given the current system state. The sensor agent is responsible for evaluating a revised visibility metric, to present to the central-planning agent. Herein, the method will model all objects (obstacles and the OoI) with accurate volumetric models. A clipped projection plane is established, and the agent projects all objects onto this plane. Base visibility is, thus, calculated in the most general 3D case as: (5.1) Variable naming in Equation (5.1) is identical to Equation (4.8). However, the visible area sub-metric has been updated. When all objects are projected onto the 2-D sensor plane, the visible portions of the OoI will form distinct visible areas of pixels. Each of these regions has area, which is normalized by the total possible projected area,. This maximum is given by a second projection of only the OoI onto the same plane as before. Under this scheme, the true visible surface area of the OoI is captured in the core reconfiguration problem. This is essential for sensing complex, articulated objects, such as humans. However, this still does not provide a framework for multi-part objects. As identified previously, some parts of the OoI become relatively more important at different stages of the recognition process. Thus, the proposed metric must be further modified. Visibility, as defined above, is a measure of the amount of raw data on the OoI available to a given sensor, at a given pose. Performance must be monotonically increasing in relation to a sensor s visibility metric. Since the proposed system framework is designed to sense articulated objects, visibility is inherently dependent on (and, ideally, only on) the visibility of each of the subparts of the OoI: (5.2) Above, is the visibility of the j th sub-part of an OoI with NP parts, for the sensor. Each individual sub-part has visibility, calculated using Equation (5.1). In this calculation, only the area of a single sub-part is considered at one time. However, this naive representation treats all subparts as equally important at all times. As previously identified in Chapter 2, viewpoints exhibit differentiated importance due to, among other factors, individual sub-parts of the OoI being relatively more important when little is currently known about them. Thus, there are cases using this naive form where the function.., relating visibility to performance, could not be monotonically increasing, due to this inherent misassumption. As such, visibility is defined as a weighted sum instead.

146 121 (5.3) As part of its implementation, the sensing system must select the weights, W 1 W NP, for each instant, such that the resulting viewpoints from the reconfiguration problem maximize the amount of unknown, but useful information recovered about the OoI. Determining what information is useful depends on the application, which includes the action and form representation, the action library, the object model, and other factors. These weights may be decidedly non-linear in relation to other variables in the system. As such, this application-specific determination cannot be expressed in general form. They can, however, be readily calibrated for a given vision task. The experiments that follow use a long-term intelligent hypothesis tester to select best average-case values over multiple trials. In order to determine these weights, however, a real-world system must maintain an estimate of the current action of the OoI, as well as make predictions on the future forms through this action recognition process. To do so, the system must first produce an estimate of the feature vector to be classified (form recovery versus action recognition), within the very limited time-span before the next demand instant on the rolling horizon. This problem becomes more vexing in a multi-agent system due to data-coherency problems across different sensor agents. Thus, a novel framework that is capable of recovering the current feature vector by fusing data from multiple sensors, in real-time, and over a communication medium that is prone to communication failures during processing, is necessary. Distributed Form Recovery A robust method of recovering the form feature vector is proposed herein. In the agent-based structure above, the recognition of the form and action of the OoI is handled by a specialized agent, but the recovery of the current form feature vector is distributed among the sensor agents and the central-planning agent. This work assumes that a model library, known a priori to all agents, is available. The form recovery method itself does not need to be specified, aside from it being modelbased. The method presented herein details how to achieve real-time, distributed recovery by considering data coherency. The form-recovery method also assumes that each sensor agent maintains a track of every feature point, and as such knows if each is visible in the sensor s field-of-view. The sensor agents are designed to run asynchronously from the central-planning agent, thus, they capture new frames as fast as possible. The planner, on the other hand, operates to select new poses only at a specific rate, given by the choice of demand instants. Each sensor agent attempts to maintain an open communication

147 122 link with every other sensor agent, and the central-planning agent. The system operates under the following rules: 1. Each sensor agent broadcasts, to all connected agents, the detected locations of any visible points with each new frame (or every n f frames, to reduce network traffic), along with a timestamp of when that frame was taken. 2. If an agent receives newer estimated pixel coordinates, as seen by another sensor, it rebroadcasts this information to each connected agent, unless that agent is the source of this new estimate, or if the message originated from that sensor. 3. If an agent receives older estimated pixel coordinates, they are discarded. For example, in an arbitrary system, let us consider a case where Sensor Agent 1 receives the following data about Point 1 from Sensor Agent 3: Sensor Agent 2 has detected Point 1 at (x B, y B ) at time t = 0.26 s. If the last known (to Agent 1) copy of this information was Sensor Agent 2, Point 1, (x A, y A ), t = 0.25 s, then, Sensor Agent 1 would re-broadcast this new message to all agents, except to Sensor Agent 2 (source of the new data), and to Sensor Agent 3 (originator of the message). In order to facilitate transactions, the clocks of all agents are assumed to be synchronized to a reasonable accuracy. In this manner, each agent retains the latest detected pixel coordinates of every point, from the perspective of every sensor agent. When a request to the central-planning agent for the current subject feature vector is made, reconstruction starts. As this is a costly operation (transmission time), and there is little benefit to coordinated points between instants, the sensors are not continually coordinated. The goal is to reconstruct the feature vector for the OoI at the exact time the request was made, t req. The following pseudo-cost listing, Table 5.1, provides the basic method to reconstruct this feature vector. TABLE 5.1 PSEUDO-CODE LISTING FOR DSITRIBUTED FORM RECOVERY PROCESS For each point, n = 1, to n points : For each sensor, s = 1, to n sens : If use the equivalent observation with for x p, or an interpolation if none are equal. Otherwise, do one of the following: 1. Predict x p using the KF track, only if. 2. Wait for t wait until new data arrives. End Choice. End For. End For. Above, is the timestamp of the latest observation of the point by Sensor s, t max is the maximum time difference to allow prediction over, and t wait is a minimum amount of time to wait.

148 123 The second choice is deliberately general, as it will depend on the application, and even on the individual request. The system must choose whether to favor accuracy (by waiting for exact data more often), or speed of delivery. Once the values of x p (pixel coordinates) from each sensor agent have been chosen, the method robustly solves for the 3D intersection of lines projected through these 2D points, given the camera and system models, and their calibrated parameters. As a reference method, an iteratively re-weighted least-squares method (robust m-estimation, [158]) is chosen, though simpler methods can be used if the tracking method produces very few outliers and false positive detections. The choice of an iterative method is necessary if the chosen camera model does not exhibit a closed-form inverse relationship, which is commonly the case with high-order distortion models used in conjunction with high-order translation models. Feature Point Representation and Tracking To perform real-world experiments, it is necessary to provide a method for feature point identification and tracking. In previous simulated experiments, a color-based segmentation method [64] was used, due to the simplified nature of the target. However, color-based markers will not be available for many applications in the real world. As such, the sensor agents are specified to perform tracking using vision-based pixel tracking methods. Specifically, the system must implement (i) feature detection and (ii) feature tracking. For feature detection, all features must be individually detected upon entering a sensor field of view (FOV), and upon re-acquisition after a tracking loss. A detailed discussion of options for feature detection is presented in Chapter 3 and Appendix D. The general-purpose reference method used in the following experiments is a multi-view Principal Component Analysis (PCA) method. Therein, a direct image search is applied (using PCA as a dimensionality technique) to search for a combined feature vector which includes image descriptors and the image itself [156]. Feature tracking is provided by a modified Optical Flow (OF) algorithm based on the Lucas- Kanade (LK, [159]) method of estimation. Robust m-estimation [158] is used as part of this reference implementation to perform iterative re-weighting of constraints in the OF process, reducing the effect of outliers on the motion estimate. The detected and tracked pixel coordinates for all feature points are used with the above distributed form recovery algorithm to produce an estimate of all feature point locations that are visible, in world coordinates. Again, a detailed discussion of other options for feature point tracking can be found in Chapter 3 and Appendix E. The above reference implementations are used in all real-world experiments that follow.

149 Form Recovery and Action Recognition Agents Since this proposed system is designed to sense objects performing multiple, simultaneous actions, the specification for the form recovery and action recognition agents must be updated to reflect this fact. To begin, form recovery still assumes a model-based representation, wherein a feature vector comprised of detected 3-D world coordinate locations for all model feature points is constructed. As identified in Chapter 2, actions may occur at varying time scales, necessitating the estimation of a time normalization constant. However, an initial action classification must first be made. This inherently depends on the chosen form of action coding. When multiple simultaneous actions are to be recognized, as identified in Chapter 3, they may be represented either in a combinatory manner, or by using multi-action coding. The former represents all potential combinations of multiple actions as individual entries in the action library. While potentially expensive in terms of storage and search times, it is useful for simplicity when the object is known to only perform certain combinations of actions. In these cases, the previous action recognition agent from Chapter 4 can be used directly. This form of coding will be used for later multi-action experiments in Section (5.3). True multi-action coding is a more general and powerful process, wherein some subparts of an object action feature vector may be specifically marked as available for other actions. In this manner, portions of the OoI which do not contribute in positively recognizing an action may instead be used to identify other, simultaneous actions. The most general form of this method directly implements the additive scheme for multiple actions described in Chapter 3. In this case, all portions of the OoI are used to distinguish actions, but the additive combination of multiple actions is simultaneously recovered. This method is the most general, but also the most difficult to implement, so the system designer must select the method that is best suited to their OoI and its actions Pose Prediction and Referee Agents The specification of all other agents in the system remains unchanged from Chapter 4. The pose prediction agent still uses a basic KF as a reference implementation for object pose prediction. Similarly, the referee agent still implements the basic rule-sets identified in Chapter 4, and is responsible for maintaining basic fallback poses for the system. These agents will be expanded and modified as part of the iterative revision process in Chapter 6, wherein a novel pipeline structure will be presented. For basic multi-action and multi-level action recognition, however, no significant changes were found to be necessary.

125 5.1.4 Human Analogue To perform controlled real-world experiments, it is necessary to control the most significant variable in the problem at hand: the input motion.

150 Human Analogue To perform controlled real-world experiments, it is necessary to control the most significant variable in the problem at hand: the input motion. Humans do not perform actions uniformly; each action is inherently unique, and contains at least slight variations in its component motions and its length. A basic scientific requirement is that experiments are rigorous and repeatable. To ensure that all experiments that follow are repeatable, a human analogue was designed which directly implements the 14-dof skeletal model used by the system. In this manner, the input data to the system is identical to data that would be sensed if a real human were used. An overview of the analogue is shown below in Figure 5.2. FIGURE 5.2 OVERVIEW OF HUMAN ANALOGUE FOR INITIAL REAL-WORLD EXPERIMENTS From Figure 5.2, one can note that all joint locations, and the range of motion, are designed to mirror the chosen skeletal model, [95], used in this proposed methodology, and the previous single-action methodology. All library actions are captured from real humans, using a process which maps their actions directly to the 14-dof skeletal model. In this manner, there is no difference in the input data that the system receives. However, the same trial run can be repeated multiple times in this manner, allowing for repeatable experiments and detailed examination of the effects at work. Furthermore, the variation in human actions is captured by recording multiple runs of each action from the same human subject. These runs can be averaged when producing the action library to allow it to represent the average action. For experimental trials, multiple runs can be used to ensure that the system is still robust to typical variations in a real action, while still maintaining repeatability.

151 126 A detailed discussion on the validity of using a human analogue in these experiments is presented in Chapter 6. Therein, an iterative improvement to this analogue is presented, along with experiments which demonstrate that the system evaluation is identical when sensing either the analogue or a real human. Although a fully autonomous analogue is developed in Chapter 6, the manually-posed analogue shown above was sufficient for the experiments in this Chapter. 5.2 System Calibration As identified in Chapter 4, real-world action sensing requires interaction with an environment through a physical sensing system, which inherently introduces systematic error. Various calibration methods, including both camera calibration and motion calibration, are used in real-world applications to reduce the magnitude of this systematic error. Traditional sensing systems are typically designed to be robust to camera and motion error, up to a maximum tolerance level. If the error exceeds the system s tolerance, it will not function, so the designer must calibrate the system to reduce error levels to below this threshold. There are steep diminishing returns for additional calibration effort once this threshold is surpassed; thus, past research has often ignored the issue of calibration for active-vision systems. However, in an active-vision system, two main sources of systematic error are closely coupled. Using a naive approach which ignores the effect of interaction between these two sources will greatly increase the level of effort required to calibrate the system. Over the course of a pose decision, the system will operate in multiple coordinate systems: FIGURE 5.3 COORDINATE SYSTEMS AND ADDITIVE ERROR In Figure 5.3 above, the three main coordinate systems used are pixel coordinates, camera coordinates, and world coordinates. Converting from camera coordinates to world coordinates and vice versa uses the extrinsic camera calibration matrix (Chapter 3), which captures the rotation and

152 127 translation of the camera frame, relative to the world coordinate center. For fixed-camera systems, this matrix is captured as part of off-line camera calibration. However, for active vision, the matrix will change as the camera moves in the environment, and estimation error for the matrix is directly coupled to motion stage error. For example, let us consider Figure 5.4 below. FIGURE 5.4 EXAMPLE CAMERA MOTION STAGE MIS-ALIGNMENT In Figure 5.4, the camera motion stage is misaligned the true pose achieved by the camera is not equal to the given, desired pose. Instead, there is an additive error component which is posedependent. This additive error component is internally correlated among multiple axes. Let us consider when the camera moves in the above Figure 5.4. The desired motion path is purely along the x-axis, but this motion produces correlated error in y-axis position. Poses near the leftmost limit in the above figure have significant y-error, while poses near the rightmost limit do not. If the system naively applies extrinsic calibration without separating the additive error component in world coordinates, the extrinsic matrix itself becomes a function of world pose. In this case, the calibrated extrinsic matrix will be a good fit at some poses and a poor fit at others the basic extrinsic model would not be sufficient to capture the underlying process. This is the core issue with applying naive calibration in an active-vision system; by not separating the sources of error first, some parts of the system model may attempt to capture processes and error which they are not suited to represent. As such, this work proposes a combined camera and motion calibration scheme which automatically balances the calibration process, such that each source of error is calibrated by the best possible model to represent it. Although camera calibration is a well-developed field, this method represents a unique framework which proved to be better suited to the proposed sensing framework than any of the other general-purpose calibration methods which were examined.

153 Base Calibration Methodology As identified above, the complete system model for a given active-vision system will typically contain multiple partially redundant dof. To prevent an ambiguous calibration outcome, a mapping between sources of error in the environment/system and the individual calibration steps is needed. However, to generate this mapping, knowledge of the error which is successfully calibrated by each step is also needed. This circular requirement necessitates an iterative method in most real-world cases. To begin, it is useful to examine past work which comprises distinct camera and motion calibration methods. Two general types of intrinsic calibration methods have often been used in computer vision: standard off-line calibration and self-calibration [160]. The former methods are the simplest to implement, and use acquired images of an object with a priori known structure to minimize the error between the image and the object s re-projection (e.g., [141]). Self-calibration methods, on the other hand, aim to recover intrinsic parameters by automatically finding known reference points in an image at expected locations (e.g., [161]). These are still considered off-line in that the calibration occurs outside the normal operation of the system. Recently, true on-line calibration methodologies have been developed (e.g., [160], where on-line calibration of a camera is included as part of fitting a deformable mesh model to a human face). Mobile-camera calibration, however, still remains a complex problem. Motion calibration methods (extrinsic calibration) are typically classified in a similar manner. Some available off-line methods (e.g., [90], [162], and [163]) fit an a priori known motion model to a given system. Other methods attempt to discover the appropriate motion model for a system through hypothesis testing (e.g., [164]). True on-line calibration of sensor motion has also been achieved using feedback from the sensor system payload (e.g., [165], [166]). Again, however, motion calibration is typically treated separately. As such, the proposed novel calibration method will use an iterative scheme which sequentially fits each sub-part of the system model. Given these initial fits, an estimate of the modeled and unmodeled error in each step will be produced and used to modify subsequent iterations. This method will be an off-line self-calibration method, though the method can be easily modified for on-line calibration, if desired. The overall goal of the method will be to minimize the re-projected pixel error for given true pixel coordinates: min (5.4)

154 129 In Equation (5.4), the goal is to minimize error, taken as the straight-line pixel-coordinate distance between a detected point,, and true projection of the point from 3-D world coordinates,, onto the camera plane. In this manner, the calibration method will attempt to minimize the error in detected feature locations, which are the most basic input to the reconfiguration process System Model Since the goal is to minimize error in the projected pixel coordinates of a given world-coordinate feature point, it is useful to begin with the intrinsic camera model, which transforms pixel coordinates to camera coordinates. The intrinsic camera model chosen is based on the established Plumb Bob distortion model given in [140]. The intrinsic calibration is comprised of two sub-models, one for camera lens distortion, and one for the camera projection itself. For most camera distortion models, including this one, real-world lens distortions are typically assumed to consist primarily of radial distortion, and slight tangential distortion. As such, one defines distorted camera coordinates, x d, given the normalized camera coordinates, x n : / / (5.5) (5.6) 1 (5.7) (5.8) In Equation (5.5), X C, Y C, and Z C are the coordinates of a point in the camera frame, and k c1 to k c5 are the distortion model coefficients. This optical model is inherently general; it is often useful for the system designer to select a simplified model to reduce fitting effort and avoid over-fitting. For example, standard cameras (not wide-angle) often do not benefit from the 6 th -order radial and tangential component,. It is often beneficial to force it to zero during calibration, otherwise overfitting tends to occur. A final projection of these distorted points provides projected pixel coordinates: (5.9) In Equation (5.9) above, fc 1 and fc 2 are the focal distances in horizontal and vertical pixels, respectively, cc x and cc y are the principal point coordinates, and α is the angle between the x and y sensor axes. Again, simplifying assumptions can be used to reduce model complexity. For example, many camera systems can assume rectangular pixels, and thus zero skew (i.e. 0). This model will

155 130 not be developed further, as it was chosen primarily for its flexibility, and is not central to our work other models can be used. The focus of this section is on the recovery of the mapping from world coordinates into camera coordinates, X C, Y C, and Z C. The conversion from camera coordinates to world coordinates, and vice-versa, is given by the extrinsic model of the system. Again, this is comprised of two sub-models, the extrinsic camera calibration matrix, and the stage motion model. The general form of the extrinsic camera calibration transform is given as follows: (5.10) 1 In Equation (5.10), the transformation from world coordinates,,,, to camera coordinates,,, is given by a general-form rotation (coefficients ) and translation matrix (coefficients ) using homogeneous coordinates. The actual form of the rotation must be chosen based on the physical system; the following section assumes a typical three dof rotation model, which can be applied to most situations. To complete the extrinsic calibration, a fourth coordinate system is defined, stage coordinates, which are distorted world coordinates based on the output of the motion controller. A flexible mapping is used to relate world coordinates to stage coordinates. The general form for a single dimension is as follows:,,,,,, (5.11) From Equation (5.11), the adjusted world x-coordinate of a feature with detected x-coordinate is given by a translation combining all stage-coordinate camera position coordinates,,, and, and a constant,. The flexible mapping is controlled by the coefficients,,,,, allowing the mapping order to be completely user-specified. This model is inherently general, and it is expected that the designer will appropriately zero coefficients where no significant correlation is detected, to reduce the model s complexity. Similar mappings are applied to each dof in 6-dof extrinsic model. Given these models, it is now possible to define the calibration process itself.

156 Calibration Process The proposed multi-stage iterative algorithm used to recover both the intrinsic coefficients and the extrinsic mapping is given in detailed pseudo-code listing that follows, Table 5.2. Complete nomenclature for this pseudo-code listing is found in the subsequent table, Table 5.3. The key novelty of this procedure is the flexible mapping for, as described above. The system calibration recovers the inter-relationships between sensor movement and actual pose, accounting for the majority of possible misalignments. It also accounts for stage non-linearity by recovering coefficients for higher-order versions of each variable (i.e.,,, etc.).

157 132 1 Initialization Table 5.2 Pseudo-Code Listing for Proposed Novel System Calibration Method Given a set of m observations,, 2 corresponding camera poses,, 3 and OoI positions,, 4 begin with an initial guess for..,..,.., and 5 calculate.. of the observed points, as in Equation (17). 6 7 Loop 2(a) Loop 3 Calculate.., as in Equation (16), using... 1 Calculate Calculate Re-calculate.. for each point using. 10 Calculate Continue while,,,,, and, 11 Calculate for each point new K and Remove data points where. Loop 1 Continue while at least one point is removed. 13 Start with an initial estimate of ra 1, ra 2, ra R, rb 1, rb 2, rb R, rc 1,RC 2, RC R, RA CON, RB CON, RC CON. 14 Estimate kx 1,1 kx 3,I, ky 1,1 ky 3,I, kz 1,1 kz 3,I, x con, y con, z con,t 1, T 2, and T 3, where: cos cos sin cos sin 15.. sin cos cos sin cos cos cos sin sin sin cos sin.. sin sin cos sin cos cos sin sin sin cos cos cos 16,,,,,,,,,,,,,,, 17 Loop 2(b) 18 19,,, 20 Using these estimates, re-estimate ra 1, ra 2, ra R, rb 1, rb 2, rb R, rc 1, rc 2, rc R, ra con, rb con, rc con. 21 Calculate for each point... Continue while decreases, and,. Continue while,. and decreases.

158 133 TABLE 5.3 NOMENCLATURE FOR CALIBRATION METHOD PSEUDO-CODE LISTING,,,,,,,,,,, Observed pixel coordinates for a given OoI point Total number of observed points Vector of observed pixel coordinates for m th point in dataset Known world coordinates of the camera movement stage for the m th dataset point Known rotations of the camera movement stage for the m th point Pose vector for camera at m th point Known world coordinates of the m th OoI point Vector of known world coordinates of the m th OoI point Camera coordinates of the m th OoI point Normalized camera coordinates of m th OoI point Vector of normalized camera coordinates for the m th OoI point Distorted, normalized camera coordinates for m th point Vector of distorted, normalized camera coordinates for m th point Δ... Delta change in value given by ( ) from last innermost loop iteration Minimum change in intrinsic parameters Iteration counter of current loop,,,,,,,..,..,..,,..,..,..,..,..,.. Maximum parameter estimation loop iterations for intrinsic, extrinsic, and overall, respectively Vector of estimated pixel coordinates from projecting and through estimated intrinsic/extrinsic models. Outlier threshold for pixel coordinates Extrinsic model rotation function order Extrinsic model translation function orders (X, Y, Z order respectively) Rotation function coefficients for,, and, respectively Constant rotation offsets for φ, ψ, θ respectively Translation function coefficients for x, y, and z (world), respectively,, Constant translation offsets for x, y, and z (stage), respectively,, Constant axis offsets in camera coordinates For this process, the outlier threshold,, may be initialized to a high initial value, and decreased after every iteration to account for the decreasing average pixel error. Global optimality for the fitting process is not guaranteed since a tradeoff for speed and simplicity has been made. The average case performance is significantly superior to a basic algorithm that does not account for motion-stage imperfections and systemic pose errors. As mentioned above, a suitable selection of the model order is necessary to prevent over-fitting of the dataset. The rotation model used above is Tsai s model [167], though any other rotation model could be used the key is using a higher-order model to map known stage translations and rotations. The functions for A, B, and C also utilize the same general, flexible mapping, used for. However, these coefficients would typically be considerably more difficult to calibrate for. Thus, the pseudo-

159 134 code listing above assumes a simplified rotational mapping, with no correlation between dof. This can be expanded to the full model from the previous section, Section (5.2.2), if the chosen rotational model does not introduce redundant dof. The calibration method requires a dataset of known world-coordinate points, and corresponding pixel coordinates and camera poses. In order to acquire these data, in the following experiments, a high-precision X-Y stage was used to move a known OoI, marked with P uniquely identifiable markers. The X-Y stage moves the OoI across a discretized grid of N M points. Finally, the movement capabilities of each camera is discretized into R points rotationally (for 1D rotation), and D points for (1D) translation. For a given 2-dof camera, this would result in a raw dataset of N M R D P points. It is assumed that camera and OoI poses for these points are available with minimal error. This method attempts to give uniform preference to all parts of the camera motion relative to the OoI, although some areas in the problem space will be inevitably over-represented. If a priori knowledge of the application is available, some areas may be deliberately over-represented for better calibration. For any given camera dataset, it is necessary to select a distortion and translation model order that best represents the data and, hence, the sensor. Under the proposed methodology, some translation model coefficients may also be forced to zero to reduce complexity if no significant covariance is evident (i.e., kx 2,1..J and kx 3,1..K can be set to zero, if x w does not depend significantly on or ). These choices depend entirely on the quality of the collected dataset, and the physical quality of the camera arrangement. Care must be taken to avoid over-fitting. Sample Calibration To demonstrate that the proposed calibration methodology is feasible, a sample calibration is presented. The goal of this example is to demonstrate that the proposed method offers an improvement in calibration accuracy over the basic algorithm used in past work [83], and over any given algorithm which does not perform system calibration when dynamic sensors are used. This section will also demonstrate the selection of a suitable distortion model order and translation model order for a sample camera. For this experiment, Camera 1 in the physical setup is to be calibrated. The distortion model order varies from 0-order (no pixel distortion) to 5 th -order. Similarly, the translation model order (I = J = K for these experiments) will vary from 0-order (effectively no system calibration) to 2 nd -order. Note that in the 0-order translation case, the system will still calibrate for constant offsets in x stage, y stage, etc., so 4 th and 5 th cases are included for this trial, representing no translational calibration, and no

160 135 system calibration at all, respectively. In this trial, it is assumed that each of x w, y w, and z w is dependent only on x stage, y stage, and z stage, respectively, by forcing to zero the appropriate coefficients in the translation model ( i.e. kx 2,1..J and kx 3,1..K for x w ). The rotation model order will be fixed at 0-order the rotational stages were already found to be highly linear. The results for the experiment are presented in Figure 5.5 a total of 908 raw data-points were captured for input into the calibration algorithm. The first noticeable effect is that error is significantly reduced (20% reduction in best-case error) by using only 0-order translational calibration (constant offsets). Even on a high-precision table, positioning errors of a few millimeters may exist from non-controlled physical components. One can also see a clear trend in the distortion model order data. For any translation model order chosen, this dataset exhibits lowest calibrated pixel error with a 4 th order distortion model. For example, with 2 nd order system-translational calibration, a 4 th order distortion model results in 6.22% less error than no distortion model. A higher value results in slight over-fitting. At this point, one can establish that the methodology is clearly reducing the overall pixel error, compared to past calibration [90] based only on Tsai s method.

161 FIGURE 5.5 SYSTEM CALIBRATION RESULTS FOR SAMPLE CALIBRATION 136

162 137 However, it is important to examine the effect of the translation model order in more detail. One can see a further reduction in average error by using 2 nd order system-translational calibration versus 0-order. However, one can also note that the reduction in error is significantly smaller. For this sensor, the associated translational stage is highly linear, and as such a higher order model is not useful. Most of the error is in positioning of the stage itself, a human-measured and (initially) calibrated process. As such, for this camera, a 0-order translation model, and a 4 th -order distortion model was used for all later experiments. 5.3 Real-world Experiments Given the details of the improved multi-level, multi-action, sensing methodology, this section will present real-world experiments which will (i) verify the real-world operation and basic theory of the method, (ii) confirm the results of past simulated experiments, and (iii) characterize the performance of the method when sensing multiple actions and multi-level actions. A total of three experiments will be presented in this section. The first is a multi-factorial experiment which duplicates the simulated experiments from Chapter 4 in real-world conditions. This will achieve Goals (i) and (ii) above. Goal (iii) will be addressed using two joint experiments. In the first experiment, a library of human walking motions captured from multiple subjects will be implemented. The system will attempt to perform human gait-based recognition, which will characterize performance in multiple sequential action recognition. For the second experiment, the human analogue will perform pairs of randomly selected actions from an action library simultaneously. This will examine the effect of multiple, simultaneous actions. Performance will be further characterized by repeatedly performing one specific pair of these actions and estimating a secondary parameter. The action pair is specifically selected to contain actions which occur at different levels of detail, allowing the experiment to characterize multi-level performance Experimental Setup The real-world experimental environment is designed to accurately reflect a real human action sensing situation. The environment itself, including the layout of obstacles and the work environment is based on a typical medium or large room, which may contain multiple static obstacles (i.e., support columns, furniture, etc.) and multiple dynamic obstacles (i.e., machinery, other people, etc.). A common sensing task is to recognize basic human actions, such as walking or non-verbal communication (pointing) in an automated manner. The past simulated environment (Chapter 4) was designed to closely resemble this scaled real-world environment. As in the simulated experiments, all object models are based on corresponding real-world objects. Also, as described in Section (5.1), the

The layout of sensors, including their capabilities and initial poses, is determined through an off-line reconfiguration algorithm. An overview of the sensing environment is shown below, in Figure 5.

163 138 human analogue directly implements the skeletal model which would be used when tracking real humans. Using an analogue allows the experimental input, human actions, to be controlled and reproduced in a scientific manner. The layout of sensors, including their capabilities and initial poses, is determined through an off-line reconfiguration algorithm. An overview of the sensing environment is shown below, in Figure 5.6: FIGURE 5.6 OVERVIEW OF REAL-WORLD ACTIVE-VISION ENVIRONMENT As in the simulated experiments, the sensing system is comprised of four Sensor Agents, with associated physical sensors. The four physical cameras are attached to one or more motion stages, which are all calibrated using the complete system calibration method presented in Section (5.2). The other pertinent details of the environment are as follows: As for the simulated environment, the workspace has dimensions of Artificial, opaque cylindrical objects represent static obstacles in the system, such as support columns in a large room. The human analogue is mounted on an X-Y stage for positioning, resulting in a area of motion. The analogue pose is created manually using angle gauges on the analogue itself, and a library of action key-frames. The action

164 139 library itself was captured from real human subjects. All experiments in this Chapter will be performed in a quasi-static manner due to the manually-operated subject. The cameras are Logitech QuickCam Pro 9000 web-cameras, with a maximum resolution of pixels (1.3 MP). The cameras operate at a reduced resolution of for consistency with the simulated experiments. The camera hardware features automatic focus correction, brightness and contrast controls, and other automatic image correction options, all of which are disabled to prevent interference with the feature tracking algorithms. As mentioned previously, moving cameras are calibrated using a 4 th -order lens distortion model and 0-order translational and rotational motion models. All four cameras have a rotational dof with 360 of travel, but are restricted to 180. Two of these cameras have an additional translational dof with 500 of travel. Translational and rotational motion is provided by high accuracy Parker Daedal linear and rotary stages. These are controlled by a motion server architecture implemented on a Parker Motion 6K8 controller, which receives commands from a client PC via Ethernet. Motion commands are executed immediately upon receipt, and are assumed to have inconsequential delay once issued. The setup was created to be approximately 1:6 scale to a real-world room. The locations of all static obstacles are selected to be highly occluding to one associated camera, and are representative of static support columns in the model room. Motion paths for dynamic obstacles are selected quasi-randomly; paths which significantly occlude one or more cameras are favored, as paths with little occlusion are of little interest to the sensing problem at hand. A cluttered background is also present, as well as smaller, un-modeled obstacles. Experiment 1 The first experiment is designed to verify that the basic theory behind the sensing system is sound. In particular, these trials will (i) show that there is still a tangible increase in real-world, multi-action sensing performance when using an active-vision system instead of static cameras, and (ii) reproduce the conclusions about system capability and obstacles from Chapter 4 in a real-world setting. To do so, a total of nine experimental trials will be performed to examine all combinations of the two experimental factors. Firstly, the reconfiguration ability of the system will be varied in all trials: No Reconfiguration This is the baseline case for comparison, static cameras. Cameras will be placed at their initial positions, which are determined through offline reconfiguration.

165 140 Velocity-limited Slow Reconfiguration The ability of the system to reconfigure itself is constrained to the same capabilities used in the velocity-limited simulated trials. Cameras are limited to a rotational velocity of 0.35 /, a rotational acceleration of 0.70 /, a translational velocity of 45 /, and a translational acceleration of 900 /. In this case, however, these values do not reflect the maximum real-world limits on the ability of the system. Velocity-limited Fast Reconfiguration As in the simulated experiments, increased reconfiguration capability will be tested, although the cameras may not instantaneously reposition themselves to any feasible pose. For these trials, the system will operate at its maximum capabilities. Cameras are limited to a rotational velocity of 0.7 /, a rotational acceleration of 1.40 /, a translational velocity of 450 /, and a translational acceleration of 9000 /. For this experiment, real object pose prediction is used, as described in Section (5.1). Feedback about the current OoI form and action is also used in the optimization. The OoI, a human analogue model, moves through the center of the workspace on a linear path at a constant velocity of 100 /. As in the previous simulated experiments, the analogue continuously performs a single walking action. The locations of the obstacles and the OoI path are shown in Figure 5.7:

166 141 FIGURE 5.7 OBSTACLE LOCATIONS AND OOI PATH FOR SIMULATED EXPERIMENT 1 The experiment will also compare three levels of increasingly intrusive obstacles on the system: No Obstacles As for the static camera case, these trials are present to provide a baseline for comparison. Specifically, in these trials all modeled obstacles are removed from the environment. The smaller un-modeled obstacles and background clutter are still present. Static Obstacles The second trial adds four 50mm diameter, static obstacles to the environment at a priori known positions. These positions have been selected to present significant occlusion to the system, and are similar in layout to those used in the simulated experiments.

167 142 Dynamic Obstacles In these trials, two dynamic obstacles replace the static obstacles in the sensing environment. The motion paths are selected such that the OoI is closely flanked on both sides by the two obstacles. This choice of motion path presents a significant challenge to the system, in the form of continuous partial or complete occlusion of the OoI. As with the previous simulated experiments, sensing-task performance is evaluated by the total, absolute error in the recovered subject form. The error metric is calculated as in experiments in Chapter 4. However, the following experiments establish a different upper limit value for the error metric. For instants with a calculated error metric value greater than 0.125, the error in the estimated form is considered too great for the system to positively recognize the form. This value was again determined through statistical analysis of multiple trial runs to result in at least a 95% true positive form recognition rate for un-rejected frames. The absolute value of this limit will naturally differ from the simulated value, due to use of real-world feature point detection, tracking, and model fitting. The process and result evaluation remains the same, however all demand instants in a trial must exhibit error below the threshold for the action to be considered completely recognized. To address real-world variance in results, multiple identical runs for each experimental trial are averaged and plotted. For this experiment, five identical runs are used for each trial. It must be noted that no outlier test runs were noted in any of the trials, indicating that the results are repeatable. All experiments will have corresponding simulated results presented. To facilitate comparison with these simulated results, demand instant spacing is purposely aligned between both sets of trials. However, the system operates at a slower update rate than previous simulated results, resulting in a total of 10 demand instants for all simulated trials. Furthermore, due to the physical limits of the X-Y stage used to reproduce OoI motion, only five of the ten simulated instants can be reproduced in the real-world experiments. Experiment 2 The second experiment is designed to characterize the proposed methodology s performance in sensing multiple, sequential actions. The physical sensing system and base setup are identical to those used for Experiment 1. This includes initial camera poses, system capabilities and velocity limits, and obstacle positioning. The experiment itself is based on an application of TVG object action recognition human gait-based identification. The task at hand is the identification of a single subject walking through the sensing environment on an a priori unknown path, using only gait as a distinguishing feature. A total of five subjects are used for this experiment, and their gait is captured and averaged into a five-action library. In this application, gait-based identification is simply multi-action recognition;

168 143 each person is assumed to have a relatively unique gait, so each person s gait is modeled as a separate action in the action library. Uniquely identifying the action also identifies the subject. There is significant literature and past work in gait-based identification available, as identified in Chapter 1. The concern here, however, is the positive identification of unique actions. Human gait identification was merely selected as a demanding application; the difference between two human gaits is relatively small compared to two more distinct human actions (i.e., walking and waving). The overall set of trials was designed to contrast runs with and without active vision, as before. Ideal reconfiguration will not be tested. A total of 25 walks were performed along the same path by each subject. Of these, five are used for registration when constructing the reference library, and the remaining 20 provide the experimental data. As such, all subjects being tested are registered, so there will be no true negative testing. For each trial, the subject s gait was reproduced by the human analogue (to scale), and the system was allowed to run until an internal decision criterion was reached, or the subject left the workspace. Performance was evaluated with increasingly intrusive obstacles for both static and dynamic cameras, corresponding to the (i) no obstacles, (ii) static obstacles, and (iii) dynamic obstacles trials from Experiment 1. In previous trials, individual error metric values for all demand instants in an experiment were examined to characterize the outcome. For these experiments, a categorical output based on success or failure of the recognition process was recorded, instead. Results are thus classified as either (i) true positive, (ii) false positive, or (iii) false negative. If a trial meets the internal decision criteria for positive recognition, the result is recorded accordingly as a true or false positive. False negative cases cover the remainder of trials where an action/subject combination may or may not have been detected, but the system was not sufficiently confident in its estimate to reach a definite conclusion. Experiment 3 The goal of the final experiment is to verify and characterize the performance of the proposed methodology when sensing objects performing multiple, simultaneous actions. As identified previously, these actions may occur at different levels of detail. Two trials are proposed to evaluate performance for simultaneous actions. As with Experiments 1 and 2, the physical setup, including its capabilities and parameters, and the human analogue remain unchanged unless otherwise specified. Single Level, Multiple Simultaneous Actions The first trial of this experiment examines a scenario where the system must recognize two simultaneous actions. These actions will be a combination of a walking motion and one other action from the library (pointing, raising arm, and waving). Actions are encoded using the multiple-action coding scheme described in Section 5.1, so all

169 144 pairs of actions that are to be sensed are recorded separately in the action library. The time normalization factors for all actions are kept constant for simplicity. The walking path and speed is the same as in Experiment 1. Action combinations are selected randomly, and a total of 20 trials are performed. These trials are repeated for all combinations of obstacle intrusiveness and system reconfiguration ability found in Experiment 1. Multi-Level Simultaneous Actions The second trial examines a case where two specific actions are always performed, walking and pointing. Both actions must be positively recognized in all trials, but the pointing action takes place on a different level of detail than the walking motion. The pointing action is used to estimate a secondary parameter the location of a selected point in the workspace that the OoI is pointing at. The level of detail available through the use of reconfiguration can thus be characterized by the quality of the estimated secondary parameter. Relatively small changes in the recovered arm and hand poses will have a significant effect on the pointing result, if the target point is far away. Changes of the same magnitude in the walking motion estimation, conversely, are expected to have little effect on the recognition quality. A similar effect can be extrapolated in many human actions; some actions use nearly the complete body (i.e., walking), while others us a much smaller sub-portion (i.e., hand gestures, facial expressions). This trial will (i) verify that both actions are positively recognized in all trials, and (ii) characterize the quality of the estimated parameter by mean positional error and variance Basic Real-world Experimental Results The experimental results for Experiment 1 will demonstrate that the tangible improvement in sensing task performance found through simulated experiments in Chapter 4 can be duplicated in real-world experiments. To begin, let us examine the results for the simplest sensing environment, with no obstacles present. No Obstacles As mentioned in the previous section, the goal of these three trials is to establish a performance baseline. A trial with no obstacles represents the basic case of a system sensing a single subject with no interference. In reality, almost all environments will have some obstacles, although off-line reconfiguration may largely eliminate their effect. However, the hypothesis is that even under nearideal conditions, fixed cameras assigned to cover a large area cannot sense the target s action at all instants. This is often the case if sensor coverage is insufficient, leaving dead zones in the active sensing environment. While it may seem trivial to increase the number of static cameras to remove these dead zones, there may be reason not to do so. For instance, many situations need an infeasible

170 145 number of static cameras to remove all dead zones in the workspace. In any case, the goal of these results is to characterize the proposed method s real-world performance; if the system designer decides to apply the method to a given sensing problem, the claimed results presented herein will be predictable and repeatable. The results of the first three trials are presented in Figure 5.8. Both the real-world results and the corresponding simulated results are presented for convenience. As mentioned above, only Demand Instants 5 to 9 of the previous simulated experiments can be reproduced by the physical setup. These instants represent all data collected from a human walking the length of our capture area, scaled appropriately. As can be noted, there exists a close correlation between the simulated and experimental results.

171 FIGURE 5.8 SIMULATED AND EXPERIMENTAL RESULTS FOR NO OBSTACLES BASELINE TRIAL 146

172 147 Firstly, for the baseline static-camera trials, the most un-occluded views and, hence, lowest simulated error metric values occur during Demand Instants 6 to 8. During Demand Instants 1 to 5 and 9 to 11, the OoI is out of view for one or more of the static cameras. The use of either form of system reconfiguration (slow or fast) clearly improves performance, by allowing the cameras to attain preferential poses. As expected, in the absence of any obstacles, the system selects nearly identical camera poses for the slow and fast reconfiguration trials. However, minor differences exist between the simulated and experimental results at Demand Instants 6 and 7. Due to optimality issues along a time horizon, in the experimental trials, the proposed on-line algorithm first selects a near-optimal pose for Demand Instant 5 using the predicted OoI location and motion. Concurrently, the expected visibilities at Demand Instants 6 and 7 are also maximized. However, the true visibility for Demand Instant 6 is lower than the expected optimized visibility due to real-world motion estimation, since the OoI has just entered the workspace. Namely, the views provided are missing data on some subsections of the OoI that were expected to be present. Feedback from the form-recognition process works to correct this problem, as the system now selects new optimal viewpoints under a revised weighting for Demand Instants 7 and onward. By Demand Instant 7, the fast system is able to correct the problem, and selects poses which result in an error metric value almost identical to the static case. The slow system, however, takes until Demand Instant 8. This is the price of reconfiguration: in some cases global performance is improved, but at the expense of short-term performance. This effect was consistent across multiple identical trials. One can also note a similar effect in the simulated trials at Demand Instant 6, but the disparity in expected and true visibility is not as great. Since the simulated system has motion observations for the OoI prior to Demand Instant 5, its motion estimate is no longer poor. Thus, the camera poses selected in Demand Instant 5 do not present as poor a set of alternatives in Instant 6 as they do in the experimental trials. However, the remaining issue is that the slow system still does not possess sufficient camera-motion capability to reach the best possible poses at Demand Instant 6 from those chosen for Demand Instant 5. Regardless, the hypothesis that reconfiguration improves recognition performance is shown to be true in even this most simplified, baseline case.

173 FIGURE 5.9 EXPERIMENTAL RESULTS FOR STATIC OBSTACLES BASELINE TRIAL 148

174 149 Static Obstacles The second set of trials aims to quantify the improvement in recognition performance when multiple static obstacles are present in the sensing environment. This situation reflects a real-world clutteredroom scenario, where multiple fixed objects may interfere with one or more cameras in the sensing system. As outlined in the experimental setup, the obstacle locations have been explicitly chosen to present significant occlusions to the system for at least one instant. In this case, the obstacles would significantly (in a static system) occlude two cameras simultaneously at Demand Instant 8. Results are presented above, in Figure 5.9. The simulation predicts that the no obstacles and static obstacles cases will be nearly identical in recovered OoI form error when using sensing-system reconfiguration. In essence, it predicts that the system will reconfigure the cameras such that the recognition algorithm always attains unoccluded views. The experimental data verify this conjecture. The effect of the static obstacles on the static sensing system is clearly visible in Demand Instant 8. At this instant, the subject form is no longer recognized as it was in the previous, no-obstacle trial. However, the use of reconfiguration returns the error metric value to that of the original baseline trial, showing that effect of the obstacles is negated. To better illustrate this process, a movie-strip view of the fast reconfiguration trial of this experiment is presented in Figure 5.10 below. The positions of the OoI and the system obstacles are shown, as well as the recovered OoI form.

175 150 Demand Instant Actual View Recovered Form (Front) (Left) Sensor Positions Error Metric FIGURE 5.10 MOVIE-STRIP VIEW OF SAMPLE DEMAND INSTANT One can note that the recovered OoI form closely resembles the form of the human analogue for all instants. After commencing sensing operations at Demand Instant 5, the cameras begin to move toward preferential positions for sensing all sub-parts of the object. In particular, Cameras 1 and 3 approach the far limits of their travel to attempt to balance viewing the front, back, and sides of the OoI. This also avoids any potential occlusion of Camera 3 in advance. At Demand Instant 8, the OoI approaches a static obstacle which would occlude Camera 1, so the system moves this camera to the

176 151 opposite side of the obstacle to maintain an un-occluded view. In this manner, un-occluded views are available to the system at all instants in this trial. The selected poses also tend to cover all quadrants of the OoI in an attempt to maintain complete information about all sub-parts of the OoI. Dynamic Obstacles The third trial for this experiment deals with a typical real-world scenario. The system must sense the OoI s action in the presence of intrusive dynamic obstacles occluding the OoI s motion. Two dynamic obstacles closely flanking the subject represents a challenging scenario, as the OoI tends to be partly or completely occluded by these obstacles. Real-world obstacles may not be completely opaque; for example, if these dynamic obstacles were other humans, they would not present the same occlusion silhouette. In some ways, recovering the partial information available around an obstacle with a more complex occlusion silhouette may be more challenging to the system. Although extra information is available, the difficulty in extracting it may lead to false or misleading interpretations. This would be a significant challenge for the vision method, but it would likely obscure the effect of system reconfiguration. Thus, while a partially transparent obstacle might present additional false positive feature-points, the complete loss of visual information caused by the two opaque obstacles presents a more challenging problem for the reconfiguration process. Significant camera movement is necessary for the subject to be visible at all, making this scenario more beneficial for characterizing the effect of the active-vision system only. For the purposes of the experiment, the entire area bounding the two other subjects is modeled as opaque, representing this worst-case scenario. Results are presented in Figure 5.11.

177 FIGURE 5.11 EXPERIMENTAL RESULTS FOR DYNAMIC OBSTACLES BASELINE TRIAL 152

178 153 System performance for trials with static cameras, as expected, is poor the OoI form is not recognized at any instant. Simulated performance is pessimistic due to the simplified graphical representation of the subject and the greatly increased number of partial occlusions. However, the experimental performance more accurately reflects the true tendency of the results. For both trials where reconfiguration is used (slow or fast), the OoI is recognized at all instants, showing a tangible improvement. For the slow reconfiguration experimental trial, an increasing trend can be seen in the error metric. This is a result of the disparity between the system s reconfiguration ability and the OoI s velocity. While the system selects the best achievable poses at each demand instant, the OoI begins to outpace the system starting at Demand Instant 5. When additional reconfiguration capability is available, the system does not exhibit the same trend. For all of the above trials, the experimental results were highly repeatable when multiple trials were conducted, owing to the use of the human analogue to provided repeatable data. Taken together, these trials show that in all cases, the use of sensing-system reconfiguration has tangibly improved sensing-task performance. Furthermore, all results closely correspond to the predictions of the simulated experiments, and are highly repeatable Multi-Action Experimental Results Given the results for Experiment 1, it has been shown that the proposed methodology can indeed improve action recognition performance when applied to a single action OoI environment. This effect is consistent at multiple levels of system ability and obstacle intrusiveness. Thus, the results for Experiment 2 are presented herein to verify that these results are consistent for multi-action OoI scenarios. In particular, results for experiments on human identification using gait recognition are presented. These results are summarized in the following table, Table 5.4. TABLE 5.4 RESULTS FOR HUMAN GAIT-BASED IDENTIFICATION MULTI-ACTION TRIAL True Positive Rate False Positive Rate False Negative Rate Average Error Metric Value Trial Type 69.3% 4.0% 26.7% Static Cameras, No Obstacles 39.3% 16.7% 44.0% Static Cameras, Static Obstacles 5.3% 8.0% 86.7% Static Cameras, Dynamic Obstacles 81.0% 19.0% 0% Dynamic Cameras, No Obstacles 79.7% 18.3% 2.0% Dynamic Cameras, Static Obstacles 79.3% 17.3% 3.4% Dynamic Cameras, Dynamic Obstacles

179 154 One can note several important trends in this data. First, the use of dynamic cameras in a given trial significantly reduces the average error in the recovered feature vector, as predicted by the simulated results in Chapter 4, and the real-world results from Experiment 1. The difference in the average error metric value between trials with static and dynamic cameras ranges from a 53% to a 68% reduction. In fact, with static cameras and dynamic obstacles, the average recovered error is above an upper limit defined for positive recognition, 0.125, meaning that the system typically cannot recover a form estimate at all. The increasing error with static cameras and intrusive obstacles is mirrored by a falling true positive (success) rate, the chosen metric of performance. Performance is especially poor with static cameras and dynamic obstacles, as expected, with only 5.3% of subjects recognized. Note that the system is still able to reject many false positives, with the false positive rate peaking at 16.7% for static obstacles. These results mirror the past characterization of static camera performance; the task at hand is particularly demanding (similar to the dynamic obstacle trial in Experiment 1). The use of active vision tangibly improves performance. For all trials, the system rejects only a small portion of subjects (3.4% maximum with dynamic obstacles). Performance is nearly identical, regardless of the obstacles, indicating that active vision has removed their effect on the gait recognition process. Performance plateaus around 80%, which was found to be primarily due to the dataset itself even with perfect knowledge of the true feature vectors, system performance would not increase. The data simply do not provide sufficient discriminatory information. This situation could be improved by tracking additional feature points and using a more advanced human model. However, one must keep in mind the curse of dimensionality, where diminishing returns from adding additional feature points may worsen identification performance under certain discriminators. This must also be weighed against the fact that the experiment still uses relatively few registered subjects performance would likely plateau at a lower value with additional users. However, the fact remains that there is a clear difference when using active vision even with additional users, activevision theory and these results strongly suggest that performance could be improved over static cameras. Thus, the principal goal of this experiment is met; the system s performance in human gaitbased identification, a multi-action sensing task, has clearly been improved by using active vision Multi-Level Action Experimental Results A final experiment is presented, which uses the proposed strategy to sense two simultaneous actions. The goal is to demonstrate that the quality of viewpoints produced by the reconfigurable system is sufficient to allow generation of derived data. This is a typical requirement for many real-world tasks, wherein a secondary measurement is derived from the result of the primary task, form and action recognition. The secondary measurement may take the form of a separate hand gesture or facial

180 155 expression, for example. In this application, the goal is to recognize the specific point in the environment that a person is pointing to, using only the recognized form data that is available after detecting a pointing motion. However, it is first necessary to verify that the system can identify multiple simultaneous actions at similar levels of detail. Trial 1 Multiple Simultaneous Actions As outlined in the experimental description, a total of 20 trials were performed using a random selection of pairs of motions. The results are summarized below in Table 5.5. As in the previous section, positive-classification results are categorized are either true positive or false positive matches. Classification is forced, so there are only true positive and false positive results. TABLE 5.5 RESULTS FOR RANDOM, SIMULTANEOUS ACTION TRIAL True Positive Rate False Positive Rate Average Error Metric Value Trial Type 75% 25% Static Cameras, No Obstacles 45% 55% Static Cameras, Static Obstacles 5% 95% Static Cameras, Dynamic Obstacles 85% 15% Dynamic Cameras, No Obstacles 85% 15% Dynamic Cameras, Static Obstacles 75% 25% Dynamic Cameras, Dynamic Obstacles These results exhibit trends similar to those observed in the past two trials. As expected, performance using static cameras is universally poor. In fact, for the case of dynamic obstacles, performance is below what one would expect if random classification were used. However, with less intrusive obstacles, static cameras do perform slightly better than the previous trial. This is mainly because the combinatorial method for library action coding has increased the effective Euclidian distance between library actions. The difference is not significant, however, and would likely disappear if complete multiple-action coding or a larger action library were used. As with the previous experiment, the use of constrained reconfiguration tangibly improves the true positive rate. System performance plateaus around 80% which, as identified previously, is mainly an artifact of the Euclidian distance metric used for action classification. Thus, a tangible performance improvement over static cameras is confirmed for a basic multiple simultaneous action sensing task. The final trial will characterize performance when two simultaneous occur at differing levels of detail.

181 156 Trial 2 Multi-Level Simultaneous Action In this experiment, the subject maintains the same straight line walking profile as in the first experiment, Experiment 1. However, in addition, the subject begins a pointing motion upon entering the workspace, and continues to point at the same stationary spot while moving along its motion path, Figure System reconfiguration ability is the same as in the fast trials above. FIGURE 5.12 SAMPLE POINTING MOTION FOR TRIAL 2 The goal is to recognize both actions concurrently, while estimating the target location of the pointing motion at each Demand Instant where such a motion is detected. A total of 25 Demand Instants were considered. This task is typical of a real-world interaction between humans which is difficult for an automated system to process. Any error from the form-recognition aspect is multiplicatively related to the error in the final point, as its location is derived purely from recognized form. In this case, only the minimum of two key points on the subject are used to indicate the motion. The scale of this action, or the level of detail, is significantly different from the secondary, full-body motion (walking) being concurrently recognized. Results from this experiment are shown in Figure 5.13 as a scatterplot of the difference from the detected point to the actual goal point.

182 157 FIGURE 5.13 SCATTERPLOT OF RESULTS FOR SECONDARY PARAMETER ESTIMATION From Figure 5.13, the straight-line error has a mean of about 18 mm, and a standard deviation of about 12 mm. To scale, the average distance to the target point is about 7 m, so the average recovered pointing error mean is around 110 mm. The walking and pointing motions were successfully detected for all frames in the trial and hence, for brevity, no error metric plots are presented. Although it is not possible to evaluate the quality of this result without additional information about the task at hand, this result has shown that the system is at least capable of performing the recognition task at hand. A similar trial was attempted using static cameras, but the base actions were not positively recognized, even when no obstacles were present. Thus, the proposed methodology has indeed tangibly improved performance in the sensing task. 5.4 Summary This Chapter has presented a novel multi-action and multi-level sensing-system reconfiguration method designed specifically for sensing TVG objects and their actions. This method is the successor to the single-action framework found in Chapter 4, and the precursor to the customizable real-time

183 158 framework found in Chapter 3. As with the previous single-action methodology, is also useful on its own as a simplified methodology for use in low-cost, multi-action applications. In particular, the proposed methodology is more suited to monolithic implementation than the later pipeline framework. In some situations, it may also offer superior pose decision latency. The proposed methodology was shown to implement changes to the baseline central-planningbased methodology which were identified in Chapter 3. The visibility metric and framework have been designed to directly consider the sub-part visibility of the OoI, allowing for improved feedback from the current action to be used in pose decisions. Attention was given to real-world issues, including feature point detection and tracking and real-time operation of the camera system. The proposed action library system was also extended to allow simple combinatorial multiple actions and multiple action coding. A simplified real-world human analogue was also designed to allow for controlled, real-world experiments. Real-world experiments were also presented which demonstrate a tangible improvement in sensing-task performance under a variety of multi-action scenarios. Initially, baseline experiments characterized performance for varying levels of system reconfiguration (static cameras, and slow/fast reconfiguration) and intrusive obstacles (no obstacles, static obstacles, and dynamic obstacles). It was found that if sufficient reconfiguration ability is available, the system can present un-occluded views to the vision system, effectively removing the effect of obstacles on the system s performance. However, it was also found that for a given task, there is a minimum level of reconfiguration capability necessary to perform this task successfully at all times. Further experiments examined these same scenarios when multiple actions (human gait identification), and multiple simultaneous actions were to be sensed. The results were consistent with the previous experiment; the system tangibly improved sensing performance over static cameras, and removed the effect of obstacles on the system. A final trial also confirmed that actions which occur at differing levels of detail can be successfully sensed. However, some weaknesses still remain in the proposed system. For guaranteed real-time operation, a more formal framework which ensures deadlines are enforced is necessary. This will also allow for better characterization of the performance of the system, allowing formal design guidelines to be created for future system designers. The following chapter, Chapter 6, will address these and other issues by proposing a novel, customizable framework based on the pipeline architecture.

184 6. Real-time Human Action Recognition In Chapter 3, a detailed methodology for time-varying geometry (TVG) object action sensing was presented. This methodology is based on past single and multi-action sensing methodologies, which were examined in Chapters 4 and 5. These previous methods, while still applicable to simplified sensing problems, were not designed with real-time operation in mind. The customizable framework examined in this chapter is designed specifically to address real-time action sensing requirements for a wide variety of TVG object sensing applications. More specifically, this Chapter will examine details of the implementation and framework which were not elaborated on in Chapter 3. Section (6.1) will present a brief overview of the methodology, but will mainly focus on design considerations for the system designer. In particular, this section will examine how one might implement this methodology on given hardware for a particular actionsensing task. This includes considerations of the layout of pipeline hardware, real-time constraints on the implementation, and application-based customization issues. Section (6.2) will provide a formal framework for evaluating active-vision implementations. Finally, Section (6.3) will provide realworld, real-time experiments under this evaluation framework which validate the proposed methodology, compare performance to past experiments, and characterize real-time performance. 6.1 Real-time Methodology The real-time methodology discussed in this Chapter remains the same as it was in Chapter 3. This customizable framework is the culmination of an iterative design process which includes all results from Chapters 4 and 5. The framework itself is based on a 10-stage pipeline architecture, as shown in the abstract overview in Figure 6.1 (see Chapter 3 for a larger, detailed view). FIGURE 6.1 ABSTRACT OVERVIEW OF PIPELINE ARCHITECTURE All generalized details of the pipeline, including the implementation of the various stages, are available in Chapter 3, and will not be repeated here. 159

185 Real-time Constraints In Chapter 2, multiple constraints on the operation of the system were provided. These mainly deal with the sensing-system reconfiguration problem, such as the core optimization itself. However, it is difficult to mathematically specify constraints to guarantee real-time operation from the sensing problem alone. As such, this section presents a collection of metrics and constraints, developed through observation of all real-time trials, which help ensure stable, real-time operation for the experimental implementation that follows. All constraints are designed to be as general as possible, so as to be applicable to cases other than this single experimental implementation. In general, system stability and real-time performance was found to strongly depend on the choice of vision methods, prediction methods, optimization methods, motion models, camera models, and the OoI model. As such, these constraints should be viewed as a necessary set of conditions, but not a sufficient set of conditions, for correct operation. In general, it is the responsibility of the system designer to ensure that no additional constraints are necessary for their particular application or implementation. As such, several key metrics and associated constraints are defined herein. 1. The effective average update rate of the system,, must be selected based on the actions to be detected, to avoid action aliasing. In general, each action has a maximum rate of change for each degree-of-freedom (dof) in the action feature vector. Under basic sampling theory, the system must update sufficiently often to avoid missing high-frequency movements of the OoI. As such, the first condition is given by Equation (6.1). 2 (6.1) Average update rate,, must be greater than a minimum,, alternately defined by a minimum interval,. The value of is the minimum separation in time between any two key feature vectors in the action library. Under basic sampling theory, the update rate must be greater than 2 to avoid aliasing. However, since this is based on average update rate, a more conservative choice of 4 to 8 is recommended. 2. The goal or desired average update rate,, is further constrained, in order to prevent poor reconfiguration performance. Although the system designer could select a low update rate to allow the system to always examine all possible pose decision alternatives, selecting a rate that is too low will effectively reduce the reconfiguration ability of the system. In general, only a small portion of demand instants will require extensive searching during the optimization process, and the benefit from additional time spent on these instants is often small (these instants

186 161 typically have a poor set of pose alternatives to begin with). As such, a second lower limit on the system update rate is given by Equation (6.2). (6.2) 3 (6.3) One may note that if in Equation (6.2) cannot be concretely defined, then the lower limit is given by Equation (6.3). Therein, a Gaussian random variable, with mean and sample standard deviation,, is used, representing the rate given by the time it takes to complete one pose decision. The Gaussian parameters should be estimated over a sufficiently large number of samples. The trade-off between average search completeness and the lower limit can be controlled by the multiple of the standard deviation used. 3. The design update rate,, is also constrained to be less than a maximum,, as in the following equation, Equation (6.4). Overly high update rates can result in an unstable system, and also effectively shorten the length of the rolling horizon. A shortened horizon compounds the effect of poor pose decisions, as the system loses the ability to examine future instants. (6.4) (6.5) In Equation (6.5), is the length (in number of demand instants) of the rolling horizon, which is typically equal to the length of the longest action in the library. Also, is average length of the longest action in the library. One may note that this constraint must be modified in cases where there is a large difference in action lengths in the action library. The optimal case is where the action library consists mostly of actions of equal length in key-frames. 4. The decision ratio,, of the system must also be constrained. If the system uses fallback poses often, it may also become unstable, and will certainly perform poorly. As such, the decision ratio is limited to a minimum value in Equation (6.6). (6.6)

187 162 In Equation (6.6), is the frequency of demand instants with a definite sensor-pose decision, and is the frequency of instants where a default pose decision is used. Typically, 0.95, which corresponds to a 95% decision ratio. If these constraints are carefully considered for a given application, it is possible to design an implementation that can yield a success rate in excess of 95% for a given TVG action recognition task. Note that this refers to real-time success, or the number of instants which successfully traverse the complete pipeline. At a minimum, these constraints provide a clear picture of the suitability of the implementation for the given real-time sensing task. 6.2 Formal System Evaluation In Chapter 5, the issue of formal system evaluation was introduced as a potential area for novel research. Active-vision systems, including those sensing both TVG and non-tvg objects, inherently have a measurable, real-world metric performance. However, it is not always clear how to evaluate an active-vision system based purely on performance. The stated goal of sensing-system reconfiguration is to improve performance in the sensing task by improving the quality of the input data first. This goal does not specify the magnitude of the increase necessary to constitute an improvement. Similarly, it does not specify a manner to compare two systems, even if the vision task is the same. Previous experiments, including those in Chapters 4 and 5, have used ad-hoc methods to evaluate results. However, as part of a complete framework, it is necessary to provide a formal method of evaluation. This method, for an active-vision system, will consist of two parts: (i) a set of formal, mathematical criteria which govern the success or failure of a sensing system, and (ii) a framework for measuring and evaluating the system, and controlling variables inherent in the sensing environment System Comparison As discussed in Chapter 2, the core reconfiguration problem is the selection of sensor parameters (including on-line poses) which maximize sensing-task performance. Thus, a generic, real-time active-vision action sensing-system has, at its core, a global optimization problem, formulated to maximize action recognition performance. This optimization can be implemented directly, or indirectly (as in the case of the method proposed in Chapter 3). However, performance itself can generally always be measured directly. However, the actual measurable quantity for performance depends on the task at hand. For TVG object action sensing, a potential metric of performance is the success rate in recognizing current OoI form and action [168]. This metric was used in experiments

188 163 in Chapters 4 and 5. It is an example of a direct metric; success rate is calculated purely as a function of the experimental outcome. More specifically, it is the ratio of the number of trials where an action is recognized successfully to the total number of trials. Indirect metrics, on the other hand, relax the above requirement. For example, in other experiments in Chapters 4 and 5, the feature vector error level is used as an indirect metric. Although the error level is assumed to be inversely proportional to sensing task performance, it is not directly calculated as a function of experimental outcomes, or vice versa. Fundamentally, it is not possible to predict the change in sensing-task performance caused by a corresponding change in error level, if no other information is provided. It is known that performance and the indirect metric are correlated, and one can create statistical guidelines to describe this correlation, but comparison in the absence of qualifying information is not possible. The visibility metric (and corresponding optimization) that forms the core of this work is another example of an indirect metric. For an on-line system s core optimization, an indirect metric is necessary; otherwise, a cyclical requirement will result. To calculate a direct metric for performance, the task must already be complete, precluding the reconfiguration process. Off-line reconfiguration can address this issue by sampling performance through an iterative process of experiments and offline reconfiguration steps. However, on-line reconfiguration requires immediate feedback about performance. Hence, an indirect metric is used. In this case, it is well suited to the problem at hand through relaxation of requirements, it can be calculated as a function of system parameters (Chapter 2). Without complete a priori knowledge, solutions to the online pose selection optimization problem inherently cannot be guaranteed to be globally optimal, so the above relaxation is of little concern. However, when evaluating the usefulness of a proposed system, or when comparing the performance of two systems, the relaxation becomes an issue. Specifically, it significantly reduces the logical strength of any conclusion reached through comparison or evaluation of the metric. Thus, for system evaluation, this work will only consider direct metrics of performance. The performance objective function, Pr, is used directly in all equations that follow, and must be a direct metric of system performance. For the real-world TVG action sensing experiments that follow,, will be specified as success rate. Given the general-case performance optimization from Chapter 2, and a direct measure of performance, it is necessary to formally specify the evaluation. The most basic requirement for a successful application of sensing-system reconfiguration to a given vision task is that vision-task performance improves. More formally, vision-task performance may be equivalent to performance for static cameras, or it may increase, but it may never decrease.

189 164. (6.7) Thus, Equation (6.7) gives the first criterion for a given sensing task, if sensing-task performance using the reconfigurable sensing system,, is no worse than static camera performance,., then the system is considered to be successful. This establishes the basic requirement for a system designer to achieve with their system implementation. However, further specifications are necessary to allow comparison of multiple designs. If, for a given sensing task, two systems are being compared using the same, direct metric of performance, an equivalency of comparison can be written. (6.8) In Equation (6.8), and are the measured metrics of performance for systems one and two, respectively. The variables and are ordinal rankings for systems two and one, respectively, in terms of the task goal specification. Simply put, if the performance for system two is greater than that of system one, then system two is considered to be strictly more useful or successful in improving sensing-task performance. This is what allows ordinal relationships to be constructed through transitivity. However, it is useful to extend the evaluation framework to include visibility, an indirect metric of performance. Visibility cannot provide the logical equivalency shown above, but there are still many situations where it is useful or necessary to examine visibility directly. As per the system specification in Chapters 2 and 3, it is assumed that performance,, is a monotonically increasing function of the indirect metric, visibility. Visibility is, in turn, a function of direct, controllable system variables the sensor poses. This allows one to form an implication, rather than an equivalency. (6.9) In Equation (6.9), the logical implication is based on a comparison of the visibility metrics of systems one and two, and, respectively. If the visibility metrics of system two are higher than those of system one, and the function relating visibility to performance,, belongs to the set of monotonically increasing functions relating visibility to performance,, this implies that the performance of system two is at least that of system one. This is inherently a logically weaker evaluation than Equation (6.8), as a result of the indirect metric. The reason the inference is useful, though, is that visibility has been selected as part of the system s design to be related to sensing-task

190 165 performance in a specified manner. Because of the logical implication, if either input condition is not satisfied, one cannot draw any conclusion about performance. Similarly, one cannot use performance to make predictions about visibility. However, in this manner visibility can be used to draw limited conclusions about performance, which can in turn be used to make more complex conclusions about relative usefulness. Given an ordinal ranking of systems based on performance, a quantifiable measure of the improvement can also be defined. This quantifiable measure is available for any two direct metrics of performance, and, with uniform representation. (6.10) In Equation (6.10), the quantifiable measure of improvement,, is defined simply as the difference in the values of the direct metric of performance. This measure of difference can be used to draw ordinal conclusions for multiple systems with the same sensing task and direct performance metric. However, it still cannot be used to draw conclusions between two systems which have significant difference in the sensing task at hand, or in the way performance is measured. This is a fundamental issue when comparing two essentially different sensing applications or solutions. Together, these conditions can be used to form a necessary, but not sufficient, condition of equivalency between two systems in the general case. For specific cases where more details of the task and system are known, it may be possible to establish further relations to complete a necessary and sufficient condition set. However, all evidence suggests it is not possible to do so for the general case. Thus, the above evaluation scheme can be used to do only the following: 1. Evaluate systems with uniform sensing task specification and direct performance metrics, including ordinal evaluation and equivalency of performance evaluations. 2. Provide strong evidence for equivalency of performance between dissimilar systems, in the form of necessary, but not sufficient, conditions. Namely, these evaluations are the only scientific conclusions which can be drawn from experiments evaluating implementations of the proposed framework. One must be careful to avoid intuitive or qualitative evaluations, since as shown above, they may be potentially misleading in some cases Evaluation Environment and TVG Analogue Design Given the above evaluation framework, it is also necessary to conduct any evaluation experiments in an environment which controls the appropriate variables to allow for a scientific experiment. Multiple variables, specific to TVG object action sensing, were identified in Chapter 2. In Chapters 4 and 5,

191 166 the design of simulated and real-world experimental environments was examined in detail. The environments presented in these Chapters inherently divide variables into three groups: (i) purposely controlled, (ii) purposely uncontrolled, and (iii) uncontrolled. Uncontrolled variables are the simplest to classify these are any variables which simply cannot be controlled, desirable or not. Examples are typically random error sources, such as pixel location noise inherent in vision-based sensing, pose reproduction noise, and form reproduction noise. While the effect of these variables can be minimized, they cannot be fixed to a specific level. Even in simulation, some uncontrollable variables may exist. Purposely uncontrolled variables, on the other hand, are variables which can be controlled, but are purposely ignored instead. A common goal of experiments is to realistically portray real-world situations, wherein inputs to the system would naturally be uncontrolled. Effects such as lighting, background clutter, and so on, can change the output of the system. However, the proposed framework is designed to be robust to such issues. As such, they can be left uncontrolled their contribution to experimental variation should be small, and can be removed by examining multiple trials. Lastly, purposely controlled variables include those which absolutely must be controlled in order to create repeatable, scientific experiments. In general, the real-world is inherently uncontrolled, so the goal for any experiment is to limit the use of controlled variables to create the most realistic results possible. Some controlled variables are desired; for an experiment to determine the effect of obstacles on system performance, one would naturally control the number, type, and pose of all obstacles. However, controlling other variables is not necessarily desired, but is necessary. The most basic variable, which must be controlled for all experiments, is of course the TVG object itself. The TVG object (and its actions) provide the most fundamental input to the system, and the system outputs (selected poses, and in turn, performance) are shown to widely diverge for even small changes. To conduct controlled experiments, a controllable, repeatable TVG analogue is necessary. In general, TVG objects will not repeat actions perfectly over multiple trials. The variation in output caused by this prevents such results from being generalized to scientific conclusions. This necessitates control over the OoI s form. It should be noted, however, that as part of a complete evaluation, one must still examine the effect of action variation from real TVG objects. TVG Analogue Design To control the OoI form in real-world experiments, a user-controllable TVG analogue is necessary in most situations. For example, Chapter 5 presented a simplified human analogue which directly implemented the 14-dof model used by the proposed methodology. While the experiments evaluating

192 167 this methodology were successful for this specific case, the customizable framework requires a general-purpose procedure to create a suitable TVG object analogue. Furthermore, as part of this procedure, it must be shown that it is valid to generalize any evaluation of a system sensing this analogue to cases where the system would sense the real object. To achieve these goals, the design of the human analogue must accurately replicate the input data a real-world human would present to the reconfigurable sensing system. The premise is that if a system is presented with identical input data, it should make identical decisions. As such, a formal design methodology is presented herein to ensure a rigorous theoretical basis for all design choices when creating the TVG object analogue. The method described below can be applied to any TVG object which suits the criteria for an articulated object established in Chapter 2. Specifically, this method assumes that the object at hand is best represented by a collection of rigid sub-parts, which connect and articulate at specific points, or joints. Herein, it is also assumed that the intended subset of object actions that one wishes to recognize is a priori known. Furthermore, it is assumed that a set of object input data is available in the form of 3D world-coordinate feature points taken from multiple subjects (where applicable, such as for humans), where the OoI performed all library actions over multiple trials. This is, in the following experiments, the set of motion-capture data which is taken from real human subjects. As such, the process that follows will design a model and analogue which is inherently designed to capture the particular set of features unique to the task at hand. The choice of which particular feature points to use is deliberately left unspecified, as it is highly dependent on the choice of vision algorithms (i.e., which feature points can be uniquely identified and tracked), and on the specific set of actions. To begin, this method will examine the feature points which comprise the feature vectors described the object s form and actions. A feature point is given by:,, (6.11) Equation (6.11) gives the feature point measured for the repetition by subject of key-frame of the feature of the library action. If multiple sample objects are not available (or, all objects of the same class are essentially identical in their actions), then, one would fix 1. It is assumed that all world translation, rotation, and scaling effects have already been removed from the feature points, and that the features are aligned in space. Similarly, it is assumed that the data has been suitably time-normalized using either definite start and end points, or a definite periodicity. The same set of feature points represents all actions, although only a subset may have significant

193 168 variability over an action. The average of each point over repetitions,,, is taken to remove inter-trial noise., (6.12),, It is assumed herein that the underlying model will remain constant, and that the resultant TVG object analogue should represent an average subject for the given object class. As such, a generalized connection graph, is formed, with one central reference point ( 0) as the only source. For humans, for example, this source point is typically the center of the head or torso, depending on the action subset chosen. The set of vertices, 0,1 corresponds to the selected feature points, and the connection set 0,,, represents all rigid physical connections (, ) between feature points. No cycles must exist in the graph. For each connection, where 1, compute the straight-line Euclidian distance for all subjects, frames, and actions using Equations (6.13) and (6.14) below.,,, (6.13), (6.14) For the final model, the lengths of all rigid segments (one per connection, ) are given in Equation (6.14) as the average connection length, taken over subjects, frames, and actions. Similarly, for each connected pair of feature points, one must complete the conversion to polar coordinates, and average over all subjects. Taking the x component of, as,, the y component as,, and the z component as,, the following set of equations finds the inter-action variance associated with each joint in the model graph. arccos,, (6.15) arctan2,,, (6.16) (6.17) (6.18)

194 169 (6.19) (6.20) Above, 2 is a quadrant-aware arctangent function, and Equations (6.17) and (6.18) represent the mean center direction over all key-frames and actions for a particular joint. The number of dof for the joint,, are selected based on the exhibited angular variance. 1, (6.21) 2, 0, Two dof are used if and, where and are minimum levels of angular variance (not specified). One dof is used if only one corresponding condition is true. If neither conditions are passed, the connection is deleted (all deeper-level connections are merged upwards). Under this design scheme, a minimum number of dof can be used by the TVG object analogue to represent the greatest amount of pose variance. In particular, the analogue is designed with the feature vector representation of the sensing system in mind. Experimental Verification of Design Process As an example of the above process, the implementation of the human analogue used in the experiments in Section (6.3) is considered. The original articulated model uses 14 tracked feature points with 12 connections to sense four simple human actions (walking, waving, pointing, kicking). The design process found five static points, and removed two connections, for a total of 12 dof. The final model created by this process is shown below in Figure 6.2.

170 FIGURE 6.2 HUMAN ANALOGUE DESIGNED FOR FUTURE REAL-WORLD EXPERIMENTS For the process described in the previous section, motion capture data from real human subjects was used.

195 170 FIGURE 6.2 HUMAN ANALOGUE DESIGNED FOR FUTURE REAL-WORLD EXPERIMENTS For the process described in the previous section, motion capture data from real human subjects was used. The above figure, Figure 6.2, shows the previous simplified analogue, the system skeletal model, and a real human for comparison. The chosen joint positions, joint travel, and the rigid segments all closely correspond to the model used by the system, which (via the design process) is representative of the data to be sensed. The appearance of the analogue is different, but the system is not concerned with surface models, but only with the exact input to the system, detected feature point locations. A series of simple experiments were designed to demonstrate that the poses selected by the system observing the human analogue are essentially identical to those selected when observing a real human. The human analogue described above was used in all experiments, with all dimensions and measurements scaled appropriately from the corresponding human subject. A total of four motions were tested, and the results for a walking motion are examined in detail. For each action, 70 frames were considered (approximately 7 seconds of real-time), with 5 repetitions for each trial. The physical system sensing the human analogue is identical to the system that will be used for the experiments that follow in Section (6.3). As such, it will be explained in detail in the subsequent section, as the details are not relevant to this experiment. Similarly, the sensing environment is the same real-world environment used for all experiments in Chapter 5. Four static obstacles are present, each located 100 mm towards the workspace center from the camera starting positions. The background and foreground were cluttered with other, smaller (un-modeled) obstacles. An overview of the experiment is shown below in Figure 6.3.

196 171 FIGURE 6.3 HUMAN ANALOGUE VERIFICATION EXPERIMENT OVERVIEW Each action was performed times by both the human analogue and an actual human, and was sensed by both a static system and the active-vision system. Although all actions were tested, it is beneficial to examine one set of trials in detail, as all trials were consistent in their findings. As such, let us consider the results for the walking motion, as shown in Figure 6.4. Units are average perelement feature-vector percent difference from the recovered form to the actual subject form.

197 172 FIGURE 6.4 COMPARISON OF ERROR METRIC FOR HUMAN ANALOGUE AND REAL HUMAN The baseline results for the human subject show that the average per-element feature vector error is significantly decreased when active vision is used, as expected. Indeed, many frames sensed by static cameras exhibit a recovered form with error above the user-selected upper limit for recognition success. These results match all previous results in Chapters 4 and 5. Formally, the active-vision system is considered successful in sensing the actual human, as Similarly, when sensing the human analogue, performance is also tangibly improved, with In addition, , indicating that for this particular set of circumstances and implementation, the magnitude of the performance increase is also accurately reflected. In general, this second relation is not guaranteed, as it is tied to other non-repeatable, uncontrolled variables. As discussed in Section (6.2.1), this is caused by the choice of performance metric, feature vector error, which is an indirect metric. It was chosen for this experiment because it provides an intuitively satisfying result, and can be evaluated in detail at multiple instants in the trial. However, the strength of the conclusions it can support is weak. In this sense, one can examine a direct metric, success rate.

198 173 In this experiment, success rate is defined as the ratio of the number of instants where the action was positively recognized to the total number of instants sensed. The active-vision system, sensing either the human analogue or the real human, exhibited a 95.7% success rate. Similarly, the static system achieved a 68.5% success rate in both cases. Under Equations (6.7) to (6.10), the evaluation of these results is identical, providing stronger evidence that both evaluations are equivalent. It is also useful to examine the direct output of the system, as the time-sequence of selected poses is what determines the final performance in the sensing task. Similarities or differences can provide useful information about the effect of the human analogue on the system s evaluation. To formalize this comparison, the following conditions for equivalency are imposed: (6.22) (6.23) If for all demand instants 1 in the trials, straight-line distance between the poses selected when sensing a real human,, and when sensing the analogue,, is less than a threshold, the first condition is satisfied. The second condition defines a similar upper limit,, on the mean pose difference. Given these additional necessary conditions, the system outputs (four rotational angles and two translational displacements) are shown below in Figure 6.5.

23), where 125 and 18 for positional component and 0.383, 0.042 for the rotational component.

199 174 FIGURE 6.5 SENSOR POSE COMPARISON FOR HUMAN ANALOGUE AND REAL HUMAN Overall, one can note from Figure 6.5 that the system selected nearly identical poses when sensing either the human or the human analogue. This is verified under the conditions in Equations (6.22) and (6.23), where 125 and 18 for positional component and 0.383, for the rotational component. Less formally, it is also evident that the selected poses are highly correlated (essentially identical, for most instants). As such, one can conclude that the analogue has no significant effect on sensor pose choices of the active-vision system for this action. Similarly, all necessary conditions thus far are satisfied there is strong evidence that system evaluation will be equivalent when sensing either the analogue or a real human. However, it is not

200 175 possible to disprove the existence of an error case, so experiments with real humans will be included in Section (6.3). Such experiments cannot be generalized as the analogue-based experiments can, but they are still useful in this context. 6.3 Real-world, Real-time Experiments As with the previous single-action and multi-action sensing methodologies, it is necessary to test the proposed real-time customizable framework through rigorous scientific experiments. In Section (6.2), a formal method of evaluation for active-vision systems sensing TVG object actions was presented. This method forms the basis of the initial experiments in this section, which are designed to validate the basic assumptions about sensing-task performance made in developing the proposed methodology. Further experiments are necessary to characterize the framework s behavior in a variety of situations. More specifically, the experiments presented in this section are designed to (i) verify that past conclusions about the effect of system ability, obstacles, and other effects remain the same for this real-time methodology, (ii) compare an example implementation of the proposed framework to past simulated and real-world, quasi-static implementations, and (iii) characterize the real-time performance of the proposed framework, as real-time operation is the core focus of this method. To achieve these goals, a total of three experiments will be presented in this section. The first is designed to achieve Goals (i) and (ii) above. In this experiment, the two-factorial trials from Chapters 4 and 5 (which compare varying levels of system reconfiguration ability and varying levels of object intrusiveness) are recreated using the new implementation. The results from these trials are compared to past simulated and quasi-static results. However, as identified in Section (6.2), this experiment alone may not be sufficient to achieve Goal (i). As such, a second experiment examines multi-action recognition performance for real humans. Together, these two experiments satisfy the basic verification requirements for the proposed framework. A third experiment, consisting of a series of trials which vary system design parameters, including number of obstacles, cameras, and library size, is also proposed. This set of trials is specifically designed to characterize the real-time performance of the system and to determine the limits and boundary-case performance of a sample implementation. The results are later generalized, to allow the system designer to make intelligent predictions about future system performance during the design phase Experimental Implementation Before examining the details of the experimental trials to follow in Section (6.3.3) to Section (6.3.5), it is necessary to examine the details of the experimental implementation of the customizable

201 176 framework proposed in Chapter 3 and Section (6.1) above. From the proposed general framework outlined in Chapter 3, a sample implementation was designed specifically for sensing human actions, in keeping with past simulated and quasi-static experiments. The customizable design procedure resulted in a modified 10-stage pipeline design which makes best use of the available hardware. Pipeline Hardware The processing hardware available to the pipeline system consists of two identical PCs, which have Intel Core 2 Duo processors clocked at 3 GHz (approximately 24.0 GFLOPS each of theoretical computational power, as reported by Intel). Motion control processing is handled separately by the Parker 6K8 Ethernet-based controller. One of these two identical PCs is connected directly to all sensors in the physical sensor system and implements the initial stages of the proposed pipeline. This particular PC also contains an ATI Radeon HD2400 GPU, which is used to implement some pipeline tasks through GPU-based processing. The division of computing tasks will be examined at the same time as the relevant pipeline stage. All processing elements (the two computers and the motion controller) are linked via Ethernet (100BASE-T). Pipeline Stages and Implementation Details This section will begin at the head of the pipeline, Stage L1. Pipeline control is decentralized, and the system clock is shared via Ethernet between all processing elements. Stage L1 This stage implements four virtual imaging agents, one per physical camera. The agents are implemented entirely in OpenGL/GLSL (OpenGL Shader Language) to make use of GPU hardware (ATI Radeon HD2400 Pro). GPU-based fragment shaders are ideally suited to image filtering, as they make use of the highly parallel rendering hardware. For the image correction stage, the agent implements two filters. First, a separable Gaussian LPF [169] is used to reduce random image quantization noise. It is assumed that the raw image is corrupted by approximately Gaussian noise, with measured (8-bit color channel depth, intensity range 0-255). The overall system is relatively robust to low-level image noise, so a simple LPF was found to be sufficient. Other methods from Appendix A could be used if the image quality is poorer, or if the tracking algorithms used are less robust. The identical X- and Y-direction Gaussian kernels are calculated using the method found in Appendix B, yielding the truncated kernel in Equation (6.24) for the measured value (6.24)

202 177 The complete blur is applied through two successive convolutions, one each for the X and Y directions ( ). Distortion correction filtering is implemented based on the chosen lens distortion model, the Plumb-Bob model [140]. The process used is based on [141], as described in Chapter 5. For each pixel in the raw input image, the intrinsic camera calibration matrix,, is used to convert its coordinates,,, to normalized, distorted camera coordinates, (6.25) As originally outlined in Chapter 3, in Equation (6.25), fc 1 and fc 2 are the focal distances in pixels, cc x and cc y are the principal point coordinates, and α is the pixel shear angle. These are determined offline for each physical sensor by the complete system calibration method proposed in Chapter 4. After this conversion, the lens distortion model is reversed to yield normalized camera coordinates,, by solving equations from Chapter 3, repeated for convenience as Equation (6.26) to Equation (6.29). / / (6.26) (6.27) 1 (6.28) (6.29) For Equation (6.26), the undistorted camera coordinates of any initial pixel are,,. The model depends on up to five lens distortion coefficients,, which are determined through the novel combined system calibration method presented in Chapter 4. This distortion model was discussed in detail in Chapter 5. For the chosen cameras (Logitech QuickCam Pro 9000, as in previous experiments), the implementation assumes no pixel shear ( 0 ) and fourth-order distortion ( 0). After calibration, the average non-outlier re-projected point has less than 0.1 pixel-length error (i.e. when the true 3D position of a calibration point is projected through the motion and camera model, the straight-line distance from the projected point to the measured point is less than 0.1 pixels, on the average). The final set of normalized camera coordinates no longer forms a square image, thus missing points are re-constructed through automatic bilinear interpolation which is readily provided in the OpenGL texture pipeline. The resultant image is cropped and shifted to restore dimensions and the origin.

203 178 This agent also implements region of interest filtering to reduce the data transfer requirements. For the chosen implementation, this consists of three additional filters operating in simultaneous passes on the un-distorted image. For all filters, the pixel-level calculation of interest,,,, is implemented in shader hardware. An Active Tracking Area filter forms an elliptical area of interest, centered at the predicted 2-D pixel location,, of the tracked feature. This location is determined by projecting the predicted 3-D world coordinate position through the motion model and camera model. The interest function,,, for pixel, is given by Equation (6.30)., 1 1 (6.30) Equation (6.30) assumes that in world coordinates, any prediction of target position, consists of the true position, plus an additive zero-mean 3-D Gaussian noise component,. This is given by Equation (6.31) below. (6.31) (6.32) By assuming zero error in measured world coordinates (i.e. ), one can characterize the Gaussian noise from the residuals, as in Equation (6.32). While one cannot not use this assumption to simply recover the exact error in each point (since is mostly likely cause by measurement error in ), it can be used to estimate,, for. Under these assumptions, one can define a 3 elliptical sphere around, within which should have an approximately 99.87% chance to lie. Projecting this shape into normalized camera coordinates yields an ellipse with center,, and principal points,,,. This gives normalized major and minor axes and, as in Equations (6.33) to (6.36) below. (6.33) (6.34)

204 179 (6.35) (6.36) When used with Equation (6.30), these equations yield an interest of at any point on the edge of the ellipse. This minimum level,, can be selected to allow bleeding of interest to accommodate the inherent underestimation due to the above assumptions. It was determined that 0.1 for the experimental implementation. Thus, this filter tends to focus interest around the most likely area for the feature point to appear in the current image. The Gradient-Based Edge Detection filter favors edges, which are typically caused in 2-D by a depth transition in 3-D space. This implementation assumes that the average static-geometry obstacle will contain comparatively few high rate-of-change edges when compared to a time-varying articulated object. This interest filter,,, is implemented as a high-pass filter (HPF) edge-detector convolution, followed by bleeding through successive Box blurs. The high-pass filter uses the standard separable Sobel edge detection kernels [170] given in Equations (6.37) and (6.38): (6.37) (6.38) One can note that since the vertical and horizontal kernels are separable, this prevents a shader bottleneck in the OpenGL pipeline. The box blur uses a 5x5 kernel, with a single constant element, Again, this filter is easily separable. A total of 5 box filter passes are used to blur the interest field. This technique forms relatively contiguous areas of interest around distinct objects in the scene, although there is a tradeoff between filling large continuous areas and bleeding of interest outside the object boundary. The final filter, based on Predictive Region-of-Interest (RoI), simply uses the pixel motion constraints generated for Optical Flow to estimate the motion of individual pixels. Given an estimate of velocity for pixel,,,,,, and the interest levels for the previous image at pixel,,,., the current pixel interest for this filter is calculated using Equation (6.39).,,,., (6.39)

205 180 The value of in Equation (6.39) is a scaling factor that can be used to prevent over-estimation of interest in cases where pixel motion is less reliable. If the pixel (,., ) is not in the previous image, the interest value is clamped to one, under the assumption that new regions of the image will be of interest. After calculating per-pixel interest, the combined interest is calculated, and the image is weighted using this combined value. A lossless RLE compression algorithm [171] is implemented on the resultant image. Further details of the compression algorithm were not available, as it is an off-theshelf library. Stage L2 The second stage operates exactly as described in Chapter 3. The clock for the stage is delayed by, where is the update frequency, to ensure images are available for synchronization. Since all sensors are connected via to a single computer, with Intel Core 2 Duo processor running at 2.4Ghz. Stages L1 and L2 are implemented as a shared task between the main CPU cores (although L1 mainly uses GPU time). Stage L3 The point tracking stage is also shared between the two cores used by Stages L1 and L2. A 4-depth superscalar implementation was chosen to maximize parallelism, given that the hardware implements 4 cores (2 real cores with Hyper-Threading, totaling 4 virtual processors). Feature points are initially detected using a modified PCA-based algorithm, which operates on a feature vector combining multiple levels of detail. During local or global search, 64x64 pixel regions are used to construct a local feature vector. The final image, P, is given by Equation (6.40): (6.40) In this equation, the input image, P, has multiple levels of mip-maps appended (the minimum mipmap level, mp, occurs when the mip-map image size is 1x1). For the PCA method and the experimental pose library, a PCA feature vector length of dimension 56 is formed by projecting onto the library basis vectors. The feature vector length was determined by trial and error; increasing n too much will eventually reduce Euclidian matching performance, and the goal is minimum dimensionality. However, small values of n will not provide sufficient separation between different model features. It is also important to note that PCA is not viewpoint independent; multiple views of each feature are included in the model feature library. While there are many more advanced methods, this method has been chosen for simplicity of implementation. The local search will be run relatively infrequently while an action is being detected; thus, the tracking algorithm is more critical in this application.

206 181 As such, 2-D pixel tracking is provided by a modified Optical Flow algorithm based on the Lucas-Kanade method of estimation [159]. However, robust m-estimation is also implemented, with iterative re-weighting of constraints to improve the estimation of the motion field [172]. Furthermore, this stage implements a standard affine deformation and iterative image re-warping method, as in [172]. These algorithms were chosen for simplicity, and have been not been significantly modified from their proposed implementations. The region of interest tracked for each model feature by the Optical Flow algorithm is initially the 64x64 pixel area found through global PCA-based search. However, during m-estimation, an image mask is constructed, consisting of values of zero or one: zero if the pixel constraint is later rejected as an outlier, one if it is not. This mask is bled using two passes of a 3x3 box filter. Thresholding is applied, and pixels with a mask value less than are dropped from the tracked region. In this manner, the algorithm tends to permanently remove outliers from future frames. Stage L4 This stage forms world-coordinate constraints on model feature positions, given detected normalized camera coordinates. As with the previous three stages, it shares CPU hardware, due to the nature of task scheduling on the PC. Subsequent pipeline stages are implemented on the second, identical PC. As mentioned in Chapter 3, this process inherently depends on the motion model chosen. These experiments implement the combined system and motion calibration model presented in Chapter 5. Complete details of this method and the corresponding motion model are presented in Chapter 4. The coordinate translations can be abstracted to two equations, Equations (6.41) and (6.42). (6.41) (6.42) In Equation (6.41), intermediate de-rotated coordinates, are defined, which are determined by the camera coordinates (not normalized), with translation,, and stage rotation,, reversed. In Equation (6.42), the intermediate coordinates have calibrated world coordinate offset,, and the calibrated effect of stage translation is removed,. For the rotary stages in the experimental setup, stage rotation is calibrated as a first-order motion model with no correlation; all rotational directions are assumed to be independent. For translational motion, the method calibrates a third-order motion model with correlation between all axes. All equations for these models are shown in detail in Chapter 4, and have been abstracted here for brevity. Constraints in world coordinates are formed and passed to the next stage.

207 182 Stage L5 The 3-D Solver Agent implements the optimization described in Chapters 2 and 3. The chosen method for the experimental implementation directly literalizes the constraints calculated by Stage L4, and implements a simply pseudo-inverse solution as a first step. From Chapter 3, the intersection for a given point can be found be solving the parametric constraint equations. When using these equations, constraint weights are necessary to control the leverage of a particular constraint, based on its expected contribution to the quality of the solution. These weights,, are chosen based on robust m-estimation using a GM-estimator [158], Equation (6.43). (6.43) The error value for each constraint,, is calculated directly from the non-linearized constraint from Stage L4 through back-substitution. After weighting, the optimization continues through another step, and the error values and weights are re-calculated. In this manner, the implementation iteratively adjusts the weights for each constraint, reducing the effect of outliers. Stage L6 The form recovery process for this agent is implemented exactly as described in Chapter 3. As with the quasi-static experiments from Chapter 5, the method takes the center of the face, neck, chest, and shoulders as reference points. These points were found to be relatively invariant over the action libraries used in the experiments, and thus were suitable for this purpose. For the calculation of uncertainty in Chapter 3, the system uses and roughly equal weightings to each component of the uncertainty. initially. This assigns Stage L7 As mentioned in Chapter 3, the prediction agent uses a basic KF for motion prediction. Since all motion paths being sensed in the following experiments are linear, it is sufficient for the implementation to use a basic 2 nd -order set of state variables. This implements a typical positionvelocity-acceleration profile. Due to the high update rate (and hence, small time between updates), this 2 nd -order approximation also proved sufficient for form estimation over a short time-span. Obviously, this approximation would be insufficient in predicting relatively far into the future. Stage L8 The Central Planner Agent implements the Flexible Tolerance Method (FTM, [152]) to solve the core optimization proposed in Chapter 2. Since this is a constrained optimization, the limits of the sensor must be included. For each translational dof, with sensor displacement or rotation,, limits on the dof are defined using Equation (6.44).

208 183 max, min, (6.44) For Equation (6.44), and are the feasible and achievable lower limits, respectively, and and are the upper limits. The FTM requires limits of the form 0 or 0. The following equations, Equations (6.45) and (6.46), show each of the above limits after transformation into constraints. min, 0 (6.45) max, 0 (6.46) For each static sensor dof, positioning is handled through additional constraints. For a camera fixed in the x direction at exactly, the following constraint is used, Equation (6.47). 0 (6.47) Together, these constraints also form the initial search polyhedron. The flexible polyhedron search engine [152] is used to perform the optimization outlined in Chapters 2 and 3. The reflection, expansion, and contraction parameters (,,) were empirically selected to be 1.86, 0.27, and 1.11, respectively. These values yielded the best average-case performance (in terms of the optimization iterations and computation time) during the real-time experiments that follow. The selection of the optimization objective is also critically important to the operation of the system. The only controllable parameters are still the sensor poses, which directly influence visibility of the subject (all other factors controlled). The selected visibility metric for the object sub-part,, was derived for past experiments in Chapter 5, and remains unchanged in this implementation. For Equation (5.2) in Chapter 5, which gives the visibility metric calculation for the object sub-part,, the experimental weighting factors 1 are taken as constant. The value of is the total number of pixels being tracked for the sub-part, and is the number of these pixels visible for this instant. These values are estimated by 3-D simulation of the camera view using the simulated environment originally proposed in Chapter 4. Since this environment maintains an accurate representation of the real-world environment, including detailed surface models the OoI and all obstacles, it is ideal for this application. The area sub-metric can be quickly estimated by rendering a reduced-size projection to an off-screen frame-buffer. This method

209 184 is both faster than an ad-hoc polygonal projection, and is more accurate. All other details of the core optimization and visibility metric remain unchanged from the methodology proposed in Chapter 5. Stage L9 The action recognition agent operates exactly as described in Chapter 3. The action library itself was captured for the human analogue from three real human subjects. The results were fused and incorporated into the design of the human analogue using the method detailed in the previous section, Section (6.2). Stage L10 The experimental implementation of the referee agent does not implement any global rules, as none were found to be necessary for the chosen experiments. Default or fallback poses are given by a 2 nd -order KF track of each sensor parameter, with a window of 5 instants. In this manner, the fallback pose for a given instant is a prediction of the sensors continued motion, given the past five sensor pose decisions. Overall, this experimental implementation is a typical example of an implementation that would result from applying the novel, customizable framework proposed in Chapter 3 and Section (6.1) to a given TVG object action-sensing task. In this case, the design process is somewhat constrained by the available hardware and the necessity to compare results with past experiments, but the results can still be generalized Experimental Environment and Set-up The experimental environment and setup used in subsequent experiments is nearly identical to that used in previous quasi-static experiments in Chapter 5. The real-world experimental environment, as described in Chapter 5, is designed to accurately reflect a real human action sensing situation. The layout of sensors, including their capabilities and initial poses, is intentionally identical to past simulated and quasi-static layouts. A top-down view of the sensing environment is shown below, in Figure 6.6.

185 FIGURE 6.6 TOP-DOWN VIEW OF REAL-TIME EXPERIMENTAL ENVIRONMENT AND SETUP As in the past experiments, the sensing system is comprised of four Sensor Agents, with associated physical sensors.

All modeled obstacles are opaque, 100 diameter, cylindrical objects. The human analogue uses an X-Y linear stage for positioning, for a 900 500 area of motion.

210 185 FIGURE 6.6 TOP-DOWN VIEW OF REAL-TIME EXPERIMENTAL ENVIRONMENT AND SETUP As in the past experiments, the sensing system is comprised of four Sensor Agents, with associated physical sensors. To avoid repetition, only a summary of other pertinent details is presented herein. The workspace has dimensions of It contains multiple static or dynamic obstacles. All modeled obstacles are opaque, 100 diameter, cylindrical objects. The human analogue uses an X-Y linear stage for positioning, for a area of motion. The analogue itself is implemented using the method in Section (6.2), using data captured from three human subjects. All cameras have a rotational dof with 360 of travel, restricted to 180. Two cameras have a translational dof with 500 of travel. All system hardware is identical to previous trials. This includes system calibration and cameras. Static obstacle positions and dynamic obstacle paths are specified to correspond to previous experiments in Chapters 4 and 5. This environment will be used for Experiments 1 and 3 to follow. However, due to the restrictions on the environment scale, it is impossible to directly use the system to sense a real human. To complete

186 the trials required for Experiment 2, a modified experimental scheme is used. This will be described as part of the discussion on the setup of Experiment 2.

211 186 the trials required for Experiment 2, a modified experimental scheme is used. This will be described as part of the discussion on the setup of Experiment 2. Experiment 1 The first set of experiments consists of 16 sub-experiments, each consisting of 100 trials of length 10 sec. The pose library for these experiments consists of a single action; walking the outcome is simple success or failure in recognizing this action. A range of increasingly intrusive obstacles are tested: (i) no obstacles, (ii) two static obstacles, (iii) four static obstacles, and (iv) two dynamic obstacles. The positions and path space of all obstacles are shown in the overview in Figure 6.7. FIGURE 6.7 OBSTACLE LOCATION OVERVIEW FOR EXPERIMENT 1

212 187 The dynamic obstacles closely flank the subject, one per side. For each obstacle set, the level of system reconfiguration capability is also varied between (i) static cameras (none), (ii) dynamic, velocity-limited (slow), and (iii) dynamic, full-speed (fast). For the velocity-limited trials, all stages are limited to 20% of maximum capability ( 90 / translational velocity and 0.56 / rotational velocity). These experiments are selected to correspond with past simulated trials, and realworld, quasi-static trials from Chapters 4 and 5. As mentioned above, the walking motion was recorded from a real human subject, and is implemented by the human analogue described in Section (6.2). Two precision linear stages and one rotary stage are used to implement a quasi-randomized path for each trial. The path is checked a priori to ensure it interacts with each static obstacle for at least 0.5 sec. The path is also checked to ensure it remains within the workspace throughout the 10s trial. Subject velocity is varied between 50 mm/s and 100mm/s, and the time normalization factor of the walking motion is varied by randomly by up to 10%. These changes over past trials are designed to enhance the realism of the subject motion, while maintaining close correlation to past results. The goal of these experiments, as identified at the start of this Section, is to verify that past conclusions about system reconfiguration ability and the effect of obstacles are still valid for this new system. If the conclusions are identical, this provides strong evidence that the new methodology improves sensing-task performance in the same manner as the previous systems, and that the core assumptions behind all systems proposed are valid. These experiments will also show that all of the previous sensing problems can be addressed by the generalized framework. It is important to note that this does not invalidate the novelty of the other two methodologies; it is beneficial for the system designer to use the simplest system possible which addresses the sensing problem at hand. Experiment 2 The second experiment is designed to complement Experiment 1 by examining the performance of the system when sensing a real human subject. For this experiment an action library consisting of five actions (walking, waving, pointing, kneeling, and kicking) was captured from three unique subjects. The experiment will consist of three trials, one for each subject, of 10 repetitions each. The experiment will be repeated twice, once with no obstacles, and once with static obstacles. Pose libraries will be separate for each subject, to avoid issues with mixing identification and action recognition. For each repetition, three randomly selected actions are performed in sequence (but the sequence is identical between two trials of the same subject). The physical setup and hardware remains the same as past experiments, however, it is not ideally suited to sensing real humans due to the 1:6 approximate scale. Thus, the layout and camera travel

213 188 limits were modified in this experiment to allow the system to sense the human subject. The modified system setup is shown below, in Figure All capabilities and travel distance remain the same; only the geometry of the layout is modified. FIGURE 6.8 MODIFIED SENSOR LAYOUT FOR REAL-HUMAN HUMAN TRIALS When compared to previous experiments, it is evident that the system will operate with reduced reconfiguration ability, as much of the available sensor motion can only occur on paths perpendicular to the subject. In addition, depth information information would be relatively poor if only the original four cameras were to be used. To compensate for this, two static cameras, Static Cameras 1 and 2 in the

214 189 above figure, were added. These cameras are calibrated using the same complete system calibration method used for all other cameras, but are not included as part of the reconfiguration problem. Experiment 3 The final experiment shares all implementation details with Experiment 1. The experimental setup for these trials will be explained directly in Section (6.3.5) Real-Time Multi-Action Sensing and Comparison to Past Results The first experiment, Experiment 1, was designed to implement the same validation tests as previous experiments in Chapters 4 and 5. As in these experiments, the goal is to show a tangible increase in sensing-task performance when using active vision, rather than static cameras. It is also expected that for a given level of task difficulty (as determined in this case by the intrusiveness of the obstacles present in the environment), there will be a critical level of system reconfiguration ability necessary to maintain un-occluded views of the subject. As part of the experimental setup, the reconfiguration ability of the system in the slow reconfiguration trial has been selected to be exactly at the minimum level necessary to perform all tasks. In past experiments, the selected level was slightly lower, and the system was outpaced when dynamic obstacles were introduced. Results for this experiment are presented in Figure 6.9.

215 FIGURE 6.9 RESULTS FOR EXPERIMENT 1 TESTING REAL-TIME SYSTEM 190

216 191 From Figure 6.9, one can compare the baseline performance for this real-time implementation to past results. As expected, there is a tangible improvement in the average recovered form error. In particular, static cameras exhibit up to 42% error in recovered form when dynamic (highly occluding) obstacles are present in the environment. By contrast, with slow or fast reconfiguration, the error rate is relatively constant (around 5%, which is close to the noise floor for the vision and form recovery methods used). For other trials, static cameras exhibit approximately double the error rate, which is an intuitive result. Typically, due to the off-line layout of the system (which is designed for active vision), only two static cameras view the subject at all, compared to all four when cameras are movable. The off-line layout for a static system would require significantly more cameras to maintain un-occluded views of the subject at all times. For more formal evaluation, one can examine the performance graph. Therein, a clear trend is also visible. At least a 90% success rate in recognizing the action is maintained for all trials (including dynamic obstacles), with greater than 99% success in all trials without dynamic obstacles. By contrast, static cameras exhibit a decreasing success rate, beginning at 41%, and falling to 0% when dynamic obstacles are present. These results close mirror the the simulation results presented in Chapter 4, and the quasi-static experiments in Chapter 5. In all cases, the use of active vision removes the effect of the obstacles on system performance. It also improves average performance by selecting near-optimal views, and by continually sensing the subject with all available sensors. These results mirror conclusions found through all previous experiments, providing strong evidence that the underlying assumptions behind all of the proposed systems are correct Real Human Action Sensing Experiments While the results presented in Section (6.3.3) agree with past results for non-real-time systems, it is also useful to examine relatively un-controlled trials with real humans. These trials are designed to demonstrate that an implementation of the proposed framework is robust to all of the typical realworld factors associated with sensing a human in an uncontrolled environment. Note that these experiments are not repeatable, and thus not useful for generalization, but they at least provide an intuitive evaluation of the system. If these results agree with previous conclusions, it will provide additional evidence that the results from the more rigorous experimental environment are representative of real-world situations. Furthermore, human action sensing is an inherently real-time task, hence it is desirable to examine system performance in this context. As mentioned in the experimental setup discussion, five actions (walking, waving, pointing, kneeling, and kicking) were captured from three unique subjects. Results are summarized in Table 6.1.

217 192 TABLE 6.1 RESULTS FOR EXPERIMENT 2, REAL-TIME SURVEILLANCE OF REAL HUMAN No Obstacles Static Obstacles Subject Estimated Recovered Form Error (%) Success Rate (%) Estimated Recovered Form Error (%) Success Rate (%) Trials using static cameras were omitted no single repetition was successful in any of the practice runs. From Table 6.1, one can conclude that the system is functioning as intended, in that a high success rate (at least 80%) is maintained for all trials. The addition of static obstacles does not cause any appreciable decrease in performance for any of the subject. Overall percentage error values and success rates are similar between subjects. The key difference from the previous experiment is the lower overall performance. This is mainly caused by poor depth information due to the off-line layout of the system, and the difference in design scale (the system is design to sense a 1:6 scale analogue). In essence, this experiment ignores a key part of the proposed framework off-line sensing system reconfiguration to determine starting positions, capabilities, placement, etc. Despite this, a tangible improvement in performance is still achieved. Furthermore, when compared to past experiments in Chapter 5 on human gait-based recognition, a close correlation is found. Sensing-task performance is relatively un-affected by the addition of obstacles, as expected. Performance tends to plateau near 80% in all trials, which is mainly a result of the action library and matching criteria (Euclidian distance) used. Thus, these results support past conclusions about the performance of the system when sensing real humans. Specifically, the system is capable of sensing the action of a TVG object (a human) in real-time, while providing a tangible increasing in task performance over static cameras, even in the presence of obstacles. Given that the system and its basic assumptions are verified, Experiment 3 will characterize this system s performance, and its response to various design parameters Real-Time Performance Characterization Given the baseline data established in Experiments 1 and 2, it is necessary to characterize the realtime performance of the resultant system; this is the key focus of this Chapter. The goal of this experiment is to determine the relation between key problem dimensionality parameters (number of sensor dof, number of obstacles, and action library size) and typical real-time performance metrics.

218 193 As such, for each value of each dimensionality parameter a 20-repetition experiment is performed, as per the experimental setup outline. Sub-Experiment 1 For the first experiment, the goal is to characterize the effect of increasing the number of degrees of freedom the system must configure. In the physical system used for the experimental implementation, there are a maximum of 6 DOF (4 rotational, 2 translational). For increasing dof, the system adds virtual cameras sequentially using an offline configuration algorithm. Each virtual camera starts with a rotational DOF, and gains a translational DOF in sequence. Since these virtual cameras contribute only simulated information, rather than real-world information, to the reconfiguration process, success rate statistics are purely estimated for 6. For cases where 6, it is assumed that the remaining cameras are present, but static. Results for this experiment shown in Table 6.2 and are graphed in Figure TABLE 6.2 SYSTEM METRICS FOR INCREASING DOF IN EXPERIMENT 3 Degrees of Freedom Cameras Maximum Update Rate (Hz) Minimum Update Interval (ms) Actual Success Rate (%) Stage Pre-emptions Pose Latency (ms) Fallback Poses Average Horizon Optimization Depth

219 194 FIGURE 6.10 EFFECT OF NUMBER OF DOF ON MINIMUM UPDATE INTERVAL It is known from Section (6.1) that if the system is allowed to operate at the ideal design update rate, the success rate for these trials should approach 95%. Increasing the update beyond this will cause pre-emption of the slowest pipeline stages, and possibly the use of fallback poses or even task failure. All experiments that follow define the maximum possible update rate as the update rate at which the system achieves at least a 95% success rate. This is found through trial and error by setting the system clock and repeating the trials. If a 95% success rate cannot be achieved, the maximum update rate is instead the update rate resulting in the highest observed success rate. Figure 6.10 shows a line of best fit which demonstrates that for the final implementation, the minimum update interval is approximately log for the range of values tested. Minimum update interval is generally determined by the maximum time needed out of all pipeline stages, plus overhead. At low numbers of dof, the reconfiguration ability of the system is too low results are similar to static cameras, and it is difficult to even achieve 95% success. For higher numbers of dof, the effect of this parameter is dominating on update interval. However, in real-world designs, the selected number of dof should be as low as possible to achieve the desired level of performance. With very large numbers of dof, performance loss is inevitable increasing the update interval prevents poor pose choices, but the pose latency also increased, so the system cannot keep

220 195 pace with subject actions. Selecting a faster update interval causes the system to select poor poses or pre-empt the decision process and use fallback poses. Other performance parameters are summarized in Table 6.2. One can see that the actual success rate is at least 95% for trials with sufficiently high DOF and a sufficiently low update interval. For trials with a high success rate, the number of stage pre-emptions is low, and fallback poses are rarely used. Also, given the fixed time horizon and the goal to achieve 95% success by increasing the update interval, the system has ample time to search the entire optimization window (i.e. depth approaches the window size) with larger update intervals. These trends are identical for the remaining two experiments, so the presented results will focus only on the minimum update interval. Sub-Experiment 2 The second sub-experiment examines the effect of the number of obstacles present in the environment. The number of obstacles,, is varied, and for 4, virtual obstacles are used. Virtual obstacles are applied through injection of an artificial object into the video stream. All dof available in the physical system are used. Results from this trial are shown in Figure FIGURE 6.11 MINIMUM UPDATE INTERVAL AND THE EFFECT OF OBSTACLES

221 196 As one can see from Figure 6.11 and the associated fit line, the effect of the number of obstacles on the update interval is roughly. This effect is expected the addition of extra obstacles primarily increases optimization and estimation time only. Optimization is a more costly operation, and dominates update interval time. However, for lower numbers of obstacles, the system can compensate by reducing the depth of the optimization horizon without selecting significantly worse poses. Only at extremely high numbers of objects is the effect noticeable. For these cases, other steps in the pipeline, such as visibility metric calculation, become the bottleneck. For average numbers of objects, however, real-time performance can be modeled as being relatively insensitive to the number of obstacles and resultant occlusions. Sub-Experiment 3 The final sub-experiment examines the effect of pose library size on the update rate. For this test, arbitrary, randomly generated actions are added to the pose library. These actions are identical in size to the walking motion, but are purposely generated to be as dissimilar to the walking motion as possible. Results are shown for actions in Figure 6.12 below. FIGURE 6.12 MINIMUM UPDATE INTERVAL AND LIBRARY SIZE EFFECT

222 197 As expected, the library size has an approximately effect on update interval. The library is only accessed to perform action recognition and pose prediction. For action recognition, a search of the library occurs (hence, the search time). Pose prediction uses constant-time operations only. If a very large pose library is used, a more efficient search method could reduce this time further. Overall, the number of cameras has the most direct effect on the maximum achievable update rate. Implementation improvements should, thus, be focused on Stages L1 to L5 to achieve the best results. This matches the design assumptions discussed in Chapter 3. In general, the pipeline is frontloaded in complexity, as these stages contain many per-pixel operations, as well as the computationally intensive vision-based feature detection and tracking. As part of future work, novel research is proposed which would address this issue by creating vision algorithms designed specifically for use in real-time, active-vision systems. 6.4 Summary This Chapter discussed further details of the novel, customizable TVG object action recognition framework which was initially detailed in Chapter 3. This method incorporates all the changes and improvements to the baseline methodology discussed in Chapters 4 and 5. It is also designed specifically to operate in real-time in a variety of action-sensing applications. The evaluation of this system, and other active-vision systems, is also examined in this Chapter. It was determined that a formal set of evaluation criteria is necessary to accurately describe the performance of such systems. Thus, a formal set of criteria which uses direct and indirect measures of sensing-task performance to characterize the usefulness of an active-vision system was developed. A detailed evaluation environment which builds on past experimental environments was also detailed. As part of this environment, a method for creating controllable TVG object analogues was proposed. These object analogues allow one to control the most important environment variable the input data. Experiments were also presented to verify the performance of the proposed real-time method. Initial experiments recreated past trials in Chapters 4 and 5 to ensure that basic assumptions remain valid, and that all observed performance effects are present in the real-time system. Experimental results demonstrated the same tangible increase in sensing-task performance when using active vision instead of static cameras, which was seen in all previous experiments. Further trials using real humans confirmed this result, and demonstrated that an implementation of the proposed framework is capable of sensing a real-world TVG object s actions in real-time. The final experiments in this chapter characterized this performance further by examining the effect of the number of dof, the number of obstacles, and the action library size on the maximum achievable update rate. The results from this

223 198 experiment are useful to system designers, as they allow one to better predict the performance of the proposed methodology in a given sensing task. The final Chapter in this dissertation, Chapter 7, will summarize the results presented in past Chapters and the corresponding experiments. Multiple areas for future research have been identified as part of this work, so it is useful to examine them to better understand the limits of the proposed methodology.

224 7. Conclusions and Future Work This dissertation has presented a customizable, real-time, real-world framework based on a 10-stage pipeline architecture, which is designed specifically for recognizing the actions of time-varying geometry (TVG) objects. The proposed framework builds on previous single-action, multi-action, and multi-level action methodologies. These methods also remain useful for simplified sensing tasks, where the flexibility and generality of the final methodology is not necessary. All of the proposed methods are based on a novel method of on-line sensing-system reconfiguration, designed specifically with the attributes of TVG objects in mind. The basic assumptions of this method were tested through rigorous simulated and real-world experiments in Chapters 4 to 6. Real-world implementations of the multi-action and real-time methodologies were also characterized in terms of sensing-task performance and sensitivity to environmental and design factors. The final chapter of this dissertation will serve to summarize these results and the details of the completed system, Section (7.1). It will also examine, in Section (7.2), potential areas for future research, as a continuation of this work. 7.1 Summary and Conclusions The principal objective of this work was the development of a complete, general, and customizable framework which is specifically designed to sense TVG objects and their actions in real-world, realtime environments. It was identified in Chapter 1 that there are many key performance determinants which are common to any sensing problem. Obstacles may occlude the view of the object of interest (OoI), and the obstacles and the OoI itself may be moving or maneuvering on an a priori unknown path. Many other environmental variations are also possible, such as changes in lighting levels, background clutter, and so on. However, it was also determined that TVG objects have properties which raise unique issues for any system sensing their actions. In particular, these objects may selfocclude, and have other properties which greatly increase the difference in importance between varying viewpoints at the same instant. Furthermore, an action was identified to be a continuous parameter, which must constantly be sensed by the system. This means that viewpoints are also differentiated over time, as a viewpoint s importance will change based on the OoI s current action. Past computer-vision methods which have been applied to the TVG action sensing problem were examined. They are typically classified as template matching, semantic, or statistical methods. Template, or silhouette matching techniques, perform direct image-based matching on either sensor images or extracted OoI silhouettes. It was found that these methods are prone to noise (in the extracted silhouette or region of interest), and have difficulty sensing a wide variety of relative OoI poses and action sequences. Semantic methods attempted to address some of these issues by 199

225 200 extracting pose-invariant features, compiling a high-level model of the OoI, and performing action recognition on a model feature vector. Statistical methods were also proposed which extend both of these methods by using statistical dimensionality reduction techniques to decrease the computational effort required to perform action recognition. The above methods tend to avoid directly addressing environmental variations, such as obstacles, and instead are designed to be robust to error. Furthermore, all sensor input data is assumed to be fixed; while some works propose ad-hoc methods similar to off-line calibration, and others mention the benefits of active vision, the majority of these works use static cameras. To address the above issues, it was proposed that sensing-system reconfiguration could be used in place of static cameras to improve the input data first. Active-vision systems directly address sensing issues which would otherwise be addressed through highly robust (and highly complex) systems. They may also improve sensing task performance in situations where robust, static camera systems cannot. Past reconfiguration methodologies were reviewed, and were broadly categorized as off-line or on-line, and by the nature of the objects and environments they sense (static environment, dynamic environment). However, it was also identified that the majority of these works inherently assume static-geometry objects. No methods were found which directly address the issues inherent in sensing TVG objects and actions. When sensing multiple objects, many methodologies were found to use attention-based behavior, which is inapplicable to continuous sensing. As a result, a clear need was identified for a TVG object action sensing framework, which applies a novel active-vision solution, designed specifically to perform on-line system reconfiguration for TVG objects Customizable Framework Building on the above requirement for a TVG object action-sensing, active-vision system, Chapter 2 identified and quantified the constraints for such a system. It was identified that the core task for this active-vision system is to maximize sensing-task performance through manipulation of on-line sensor poses. Thus, the core optimization problem was introduced, which is solved by the system to achieve online pose selection. The critical sensing tasks for TVG object action recognition were identified as detection, tracking, estimation, and action recognition. Attributes specific to TVG objects, such as non-uniform importance of viewpoints, self occlusion, and continuous sensing, were identified as issues which the reconfigurable sensing system could also address. After formally outlining all requirements of such a system, a detailed, mathematical formulation was created. This formulation uses a metric of visibility to allow the system to directly perform constrained optimization, taking sensor poses as variables. This avoids using sensing-task performance directly in the optimization, as it is difficult to write a closed-form relation between sensor poses and

226 201 performance, and because performance measurements are typically only available after poses should have been selected by the system. Thus, visibility was defined as a closed-form function of sensor pose, while maintaining the existence of a monotonically increasing performance function. The visibility metric itself was developed as a summation of application-customizable sub-metrics, such as visible surface area, distance, and relative angle. The visibility metric was also formulated into an articulated metric which allows the system to assign weights to sub-parts of the object based on their relative importance. To implement this core optimization problem, a novel, customizable framework, based on the pipeline architecture, was proposed in Chapter 3. The proposed framework uses a ten-stage pipeline to achieve real-time, real-world operation through higher average update rates than an equivalent, non-pipelined system. The pipeline begins by asynchronously capturing images from system sensors in Stage L1, the Imaging Agent. These images are filtered, corrected, and adjusted to improve their quality and remove distortion. Non-useful sections of the images are removed through interest filtering, and the result is passed to Stage L2. This stage synchronizes the images, using an image interpolation method, to a single world time. The next stage, Stage L3, searches these images for points of interest, which it detects and tracks in un-distorted pixel coordinates. Processing continues in Stage L4, which reverses the projection model of the camera and system, transforming all detected feature point pixel coordinates into normalized camera coordinates. Detected features from all sensors are used to form complex constraints in world coordinates. These constraints are used in Stage L5 as part of a calibration-model-based method to solve for points of intersection, representing world coordinates of detected features. Stage L6 uniquely identifies feature points, and iteratively fits them into an object form feature vector using a method which combines a priori information, model constraints, and the current OoI action. Stage L7, the Prediction Agent, forms predictions of future object poses, OoI poses, and OoI forms using a predictive filter and knowledge of the current OoI action. The core system visibility optimization is implemented next by the Central Planning Agent, Stage L8, which uses all previous information to perform constrained optimization, resulting in a sensor pose decision. The vision payload is implemented in Stage L9, which recognizes actions from the stream of OoI forms estimated by Stage L7. Finally, Stage L10 enforces global rules and maintains fallback poses for the system, while also implementing motion control. As mentioned in the summary for Chapter 3, this proposed method provides a comprehensive implementation strategy which can be adapted to a wide variety of articulated TVG object action sensing tasks. In particular, single-action sensing, multi-action sensing, and multi-level action sensing were investigated in Chapters subsequent to Chapter 3.

227 Single-Action Sensing The first application considered in this work is also the simplest form of TVG object action recognition, wherein a single action is to be sensed. By examining this application, it can be shown that the use of active vision for TVG object action sensing can tangibly improve sensing-task performance over traditional, static camera systems. Furthermore, the methodology developed for single-action sensing is the precursor to the customizable real-time framework found in Chapter 3. As mentioned in Chapter 4, due to its simplicity it is also useful as a simplified methodology for use in low-cost, single-action applications. The proposed single-action methodology is based on a centralplanning-based architecture, and is comprised of a central planner agent, sensor agents, a referee agent, form and action recognition agents, and a prediction agent. Together, these agents communicate to select system sensor poses through their aggregate behavior. The specifications of the individual agents form the behavioral basis for the later pipeline methodology. A simulation environment was developed to test the proposed system. The simulator is designed to emulate all of the elements of a real-world system, and can generate accurate images from any desired sensor in the physical system. The environment itself simulates the basic factors affecting sensing task performance, while allowing individual variables to be controlled. Accurate object models and a detailed OoI model were also included in the simulation environment. The goal of the simulated environment is to provide an experimental test-bed where the basic assumptions behind the proposed methodology can be verified, thus confirming an increase in sensing-task performance over static cameras. Two main experiments were presented in Chapter 4 using this simulated environment. The first experiment examined a human walking motion taken over a straight-line path, with fixed duration. The goal was to determine if the system could successfully recognize a single action, and to characterize the increase in performance over static cameras under differing levels of system reconfiguration ability. The experiment successfully verified that both tested forms of system reconfiguration, velocity-limited and ideal reconfiguration, improve sensing-task performance as measured by the recovered form error. Furthermore, it was concluded that the reconfiguration ability of the system has a direct impact on the gain in performance over static cameras. The exact effect of reconfiguration ability was characterized in later experiments. The simulated experiments also demonstrated that some instants require feedback from the action recognition process to allow the system to select the best possible poses. The second experiment characterized the effect of real-world pose prediction on the reconfiguration process. The goal was to demonstrate that the proposed methodology is robust to the uncertainty inherent in the prediction process. The previous experiment, wherein the system senses a

228 203 walking human, was repeated in the presence of varying types of obstacles. These obstacles were either static or dynamic, and all objects in the system were tracked using vision-based tracking and a predictive filter, resulting in inherent tracking error. It was concluded that the system is robust to tracking error below a critical level, after which the system became unable to consistently recover basic data necessary to make any pose decisions. As such, it was also identified that a method to allow the system to address instances where pose prediction was erroneous, or the tracking system failed entirely, was necessary. Overall, the system was also able to remove the effect of static and dynamic obstacles on the sensing task, confirming previous assumptions, and demonstrating a tangible performance increase over static camera systems Multi-Action and Multi-Level Action Sensing The next logical application of the method developed in this work is multi-action recognition and multi-level action recognition. Chapter 5 presented a novel multi-action and multi-level sensingsystem reconfiguration method designed specifically for sensing TVG objects and their actions. This method was developed through iterative improvement of the single-action framework presented in Chapter 4. It is also the immediate precursor to the complete, general, real-time framework presented in Chapters 3 and 6. As for the previous single-action methodology, this simplified methodology remains useful for low-cost, multi-action applications. As concluded in Chapter 5, the proposed methodology is well-suited to monolithic implementation, and may also offer superior pose decision latency in some applications. The proposed multi-action, multi-level methodology implemented changes to the baseline central-planning-based methodology which were identified in previous simulated experiments. In particular, the visibility metric and framework was modified to examine the individual OoI sub-part visibilities. As concluded from previous experiments, it is necessary for the system to incorporate the relative importance of different sub-parts of the OoI model directly into the optimization process, as their importance changes based on the current action and OoI form. Real-world issues were also addressed, as the principal goal of Chapter 5 was to present real-world experiments which verified the simulated conclusions found in Chapter 4. A real-world method of feature point detection and tracking was implemented, which was based on a modified Principal Component Analysis (PCA) and Optical Flow (OF) system. A real-world action library was also captured from actual human subjects, which was extended from the previous methodology to allow combinatorial multiple actions and simple multiple action coding. A manually-movable real-world human analogue was designed to allow the experiments to control the key system input variable, subject form.

229 204 Real-world experiments were also presented as part of this application. The goal of these experiments was to demonstrate a tangible improvement in sensing-task performance under a variety of multi-action scenarios, and to verify past simulated results. Again, two principal sets of experiments were performed. The first set of experiments characterized single-action recognition performance for varying levels of system reconfiguration (static cameras, slow reconfiguration, and fast reconfiguration) and varying levels of intrusive obstacles (no obstacles, static obstacles, and dynamic obstacles). It was concluded that if sufficient system reconfiguration ability is available, the system can present unoccluded views to the vision system, effectively remove the effect of obstacles on the system s performance. This mirrors the trend noted in previous simulated experiments, and further characterizes the effect by demonstrating that a critical level of reconfiguration ability exists for any given sensing task. Furthermore, it was shown that system reconfiguration is able to remove the effect of all tested obstacles, static or dynamic. Thus, the same tangible improvement in sensing task performance that was first noted in simulated experiments is also present in these real-world experiments. The second set of experiments examined these same scenarios when multiple actions and multiple simultaneous actions were to be sensed. To examine multi-action sensing, human gait recognition for multiple human subjects was implemented as a sample application. The goal of the experiment was for the system to distinguish human subjects from a library of multiple subjects based purely on gait. The results from this trial were consistent with the previous experiment, as the system tangibly improved sensing performance over static cameras by removing the effect of sensing issues, such as obstacles, on the system. A second trial was also performed to demonstrate that multiple, simultaneous actions at differing levels of detail can also be sensed. A detailed human action, pointing, was sensed simultaneous with a secondary, full-body motion. Results from this trial mirrored all previous conclusions about the sensing system. However, it was also identified that for true real-time operation in all of the above applications, a system which can enforce absolute deadlines on all agents is necessary. It was concluded that the final design iteration must be designed specifically with real-time sensing in mind to achieve the best possible real-world performance Real-time Sensing The final application of the proposed methodology discussed in this dissertation is real-world, realtime TVG object action sensing. This problem expands on the previous multi-action, multi-level sensing problem by requiring real-time operation of the pose selection and sensing process. Chapter 6 discussed details of the iterative redesign of the previous CPA-based architecture into the novel,

230 205 customizable TVG object action recognition framework detailed in Chapter 3. This method is designed specifically to operate in real-time in a variety of action-sensing applications. A detailed evaluation of this system, including comparison to the previous active-vision methodologies, was also presented. It was concluded that a formal set of evaluation criteria is necessary to accurately describe the performance of this active-vision system, and any subsequent implementation thereof. The use of direct and indirect measures of sensing-task performance to characterize the usefulness of an active-vision system was examined, and a complete set of performance criteria was developed. A detailed evaluation environment which built on the past realworld quasi-static and simulated experiments was also detailed. A novel method of creating TVG object analogues, designed specifically to ensure realism and accuracy of evaluation results, was also presented. This method was deemed necessary, as the input data was concluded (from past experiments) to be of critical importance in determining the overall actions of the system. In order to present scientific results which can be appropriately generalized, it is necessary to perform repeatable experiments by controlling the OoI form across trials. Experiments were, thus, presented to verify the performance of the proposed real-time method and characterize its relation to system design parameters, including the number of actions in the action library, sensor degrees of freedom (dof), and objects in the sensing environment. The first realtime experiments recreated past trials in Chapters 4 and 5 to ensure that all previous conclusions about sensing-task performance were valid. Experimental results demonstrated the same tangible increase in performance seen in all previous experiments when using active vision instead of static cameras. Further trials used real humans to confirm this result in a qualitative manner, thereby demonstrated that an implementation of the proposed framework is capable of sensing a real-world TVG object s actions in real-time, and relatively uncontrolled environment. Lastly, experiments characterized real-time performance of the proposed methodology. One trial was examined in detail to provide insight into the relation between the selected update rate of the system, and other operating parameters, such as the use of fallback poses, decision ratio, and sensing task performance. It was concluded that the selection of the system update rate is a critical design parameter, which is also related to other aspect of the sensing task. This relation was characterized (using proportional computational complexity, big-o notation) for the system design parameters mentioned in the previous paragraph, allowing future system designers to roughly predict real-time performance for their given sensing task. Overall, it was concluded that the implementation, and by extension the proposed framework, was successful in tangibly improving real-time sensing task performance in a real-world environment.

231 Future Work The customizable framework presented in this dissertation is a complete and generalized solution which can be applied to a variety of TVG action sensing tasks. However, as with any general method, there are still specific applications and problem areas for future research. In particular, through the iterative design process and experiments presented through this work, three key areas for future research were identified Hand Gesture Recognition and Improved Multi-Level Recognition During the initial design phase, a set of sample applications for the proposed methodology was developed. These applications included many human-centric action recognition tasks, including gait and body motion recognition, hand gesture recognition, and facial expression recognition. In this dissertation, extensive experiments are presented on the topic of body motion and gait recognition for humans (the primary subject for all experiments is a human subject). Hand-gesture recognition, however, was only briefly examined. The reason is that multi-level recognition was found to be inherently limited by the homogeneous sensors which comprise the experimental platform. Multi-level action sensing is a differentiated sensing task, in that the system must sense two different levels of detail simultaneously. Pure hand gesture recognition, while smaller in scale than human body motion, would be similar to the experiments already presented in this dissertation. An off-line reconfiguration step would likely change the field-of-view of the cameras to have a smaller focus, but overall the results would tend to be very similar. However, by posing this problem as a multi-level action sensing problem, the issues become more vexing. It was found that if the difference in the level of detail between two simultaneous action sensing tasks is too great, there will be a significant limit on the ability of the proposed system to improve sensing-task performance. Although the system designer does have the ability to tailor the system to the task, the above method inherently assumes a relatively homogeneous sensing system and, by extension, relatively homogeneous levels of detail in all simultaneous actions. As such, future experimentation could extend the proposed method to better address systems with heterogeneous sensing capabilities. These systems, which offer differentiated sensing ability between sensors, are necessary for one to truly address multi-level actions which occur at significantly different levels of scale. The problem would be the assignment of system resources, which are now differentiated, based on the information available at a single demand instant. The goal would be to use a subset of the available sensors at each demand instant which are best suited to sensing each particular action. Obviously, this is a circular problem; one must know the current action to select the

232 207 sensors best suited to recognizing this same action. The system must also de-rate the sensing capability of sensors used to sense actions they are not well-suited to sensing. The above necessitates a secondary assignment step beyond the current optimization method. All cameras must be considered on this assignment, so it cannot be directly applied as part of the current per-camera optimization. One potential solution is to significantly expand the scope and capability of the Referee Agent, allowing it to select sensor pose decision sets which best suit the expected actions of the OoI. However, the Referee Agent is located late in the pipeline, limiting its ability in this task (as seen in cases where the current system must select appropriate fall-back poses). An ideal solution may require a feedback method which interacts with the Central Planning Agent over multiple instants. An additional pipeline stage may even be necessary to allow for this assignment to be completed. These design issues must be considered before a complete, general solution applicable to many multi-level sensing tasks can be determined Facial Expression Recognition and other Deformable Object Actions As part of the initial design process, facial expression recognition was identified as another potential application for the proposed active-vision system. This sensing task is typical of the problem sub-set defined in Chapter 2 as deformable object recognition. Facial expressions (actions) are distributed across the surface area of the face, and the nature of occlusions and viewing-angle tradeoffs are inherently different from other action sensing applications, such as gait recognition. In general, this class of TVG action sensing problems requires further research over what is presented in this dissertation. Feasibility studies have identified that the proposed methodology can indeed successfully sense deformable object actions. No parts of the method inherently prevent or restrict the types of TVG objects which can be sensed, and Chapter 2 details how one might model a deformable object using a polygonal mesh (allowing it to be sensed directly by the proposed implementation). However, objects of this type are not well-suited for this sensing system. It has been designed specifically to sense articulated objects, which can be modeled accurately using the joint-limb structure (also found in Chapter 2). Thus, this representation and methodology will likely have limited potential to improve sensing-task performance in this application. To address these issues, some changes to the proposed method would be needed. Considering that sensing these objects constitutes a separate problem class, a framework designed from scratch with deformable actions in mind would likely yield improved performance. However, this would discard the data collected from the current methodology. Instead, an iterative re-design would be more suitable. The pipeline architecture would need modifications to the early stages of the pipeline that

233 208 are responsible for image capture and improvement, and feature point detection, tracking, and fitting. Since the underlying model representation would change, most of these stages would need to be completely re-designed. The overall real-time pipeline structure would likely remain the same, although the ordering and number of stages would also change to best suit the problem at hand. The above changes are inherently general, as it is difficult to predict the final set of changes without a feasibility study and careful design work, as was performed for the articulated sensing problem in this work. In general, deformable object sensing tasks would yield significantly different system configurations during off-line reconfiguration. The range of motion of cameras relative to the object motion would tend to be reduced in some applications (consider a system to sense muscle deformations during surgery). However, other applications would require a heterogeneous system similar to that proposed in Section (7.2.1). For example, facial expression recognition would still require wide-angle cameras to sense general human motion and 6-dof pose. The basic representation of the deformable model must also be re-considered. A polygonal mesh may be efficient in some applications, but it may not be the most general. There is an inherent trade-off in the level of detail which can be measured by a system based on this representation. Finally, one must also consider how to recover, track, and fit a 3-D mesh to a given set of 2-D pixel locations in multiple images, which is an open computer vision problem. Simply put, this is a significantly different problem which, while inherently similar to the problem at hand in this work, demands a novel solution outside the scope of this dissertation True Multi-Subject Action Sensing As identified in the literature review, a natural extension to the sensing problem considered in this thesis is the simultaneous recognition of actions performed by multiple dynamic OoIs. Past work in static-geometry object sensing has yielded methodologies capable of sensing and recognizing multiple static geometry objects. A typical example is [90], which uses an agent-based methodology and sensor fusion to automatically recognize human targets. These methods have also been designed to address many of the same real-world sensing issues identified in this dissertation. For example, methods have been proposed which address multiple maneuvering obstacles (e.g., [83]), and limited non-uniform importance of viewpoints (e.g., [126]). However, past methods are inherently limited to sensing static properties of the objects. Many methods use sense-and-forget algorithms, where all attention of the system is focused on a single object until the sensing task is complete. As identified in Chapters 1 and 2, such methods cannot be applied to TVG object action sensing problems, as the properties being sensed are time-varying, and thus require continuous attention. Other methods propose multiple sensing systems (e.g., [173]) or differentiated hardware. A typical example of the

234 209 latter method is [90], wherein a single global camera is used to sense 3-dof object poses for all objects in the environment. An object s pose is a time-varying parameter, although it is inherently much simpler to sense than the complete geometry or action of a TVG object. All of the above methods tend to be impractical for complete TVG object action sensing, for a variety of reasons. The implication of the above is that a complete, general method for sensing multiple TVG objects and actions must determine the division of system resources on-line, in response to the changing sensing environment. This introduces the concept of a high-level resource assignment problem, in addition to the basic reconfiguration problem. Past sense-and-forget methods offer an all-or-nothing solution to this distribution problem, which is valid for static-geometry objects, as no information is lost by ignoring an object for a time (as long all objects are sensed before they leave the workspace). For TVG objects and actions, the system must determine the subset and amount of sensing-system resources to assign to each OoI in the environment. This task must be incorporated into the reconfiguration problem presented in Chapter 2. At this time, there are two potential methods to do so. One potential method is to continue to view the sensing-system reconfiguration problem as a monolithic problem; the sole task of the system is to select sensor poses for the next demand instant. To continue with this definition, the core optimization must be modified to consider the amount of useful information recovered about all OoIs in the environment. Currently, the core optimization is relatively independent between cameras. Automatic weighting is used to steer cameras towards poses which are most useful to the overall system, rather than locally optimal. The Referee Agent can also act to enforce conditions on the overall set of sensor poses selected by the system. However, the optimization is still performed on a per-sensor basis. While beneficial for parallelism, this makes it difficult to incorporate multiple OoIs into the optimization. In these cases, the system must consider the complete set of information recovered by all sensors. In essence, all sensor poses must be decided in one simultaneous optimization, rather than on a per-camera basis. This has the potential to increase the computational complexity of the optimization process by at least. It also significantly reduces the parallelism of one of the most computationally demanding tasks in the pipeline. Another potential method is a two-level optimization. Similar methods have been proposed for other problems, such as multi-target, multi-pursuer maneuvering target interception [174]. The goal of such methods is to use a global-level optimization to determine an assignment of system sensing resources to individual OoIs first. This assignment will take the form of a set of constraints and weights which are then applied to the current, per-camera sensing optimization. In effect, this is an extension of the current function of Referee Agent. The computational complexity of the overall method is thus improved, as the two optimizations are sequential, rather than simultaneous. For

235 210 example, if both methods are, the combined method is still 2. A similar monolithic method might have complexity, which is considerably worse for real-time operation. However, the issue with this method is that the monolithic sensing problem may not be easily separated into two different optimizations. In particular, it is expected that these tasks will be highly interdependent. Any potential solution runs the risk of over-simplifying the problem, leaving cases where the initial sensor assignment presents poor alternatives to the pose selection process, or where the pose selection process effectively reduces the reconfiguration resources available to the assignment process through poor long-term choices. Any of the above solutions will also have subtle consequences on the rest of the system, as well. To sense the actions of multiple OoIs, several areas of the current pipeline architecture would need to be expanded. By design, the pipeline structure should be able to accommodate multiple OoIs relatively easily, due to the parallel nature of many of the tasks. Feature point sensing and the visionbased tracking methods will need to be suitably advanced to detect and track all possible OoI feature points. Many of the methods listed in Appendix D and Appendix E are capable of this; only the simplest possible methods were used in experiments to highlight the usefulness of active vision. However, many applications involving multiple OoI sensing will tend to have highly similar OoIs, and even more similar feature points. For example, multi-human sensing is a typical application, and low-level feature points tend to be extremely difficult for an automated system to distinguish without contextual information. As such, the vision methods used will need to be re-evaluated. The form recovery process may need to be expanded in this context to provide improved separation of the cloud of feature points. It may also be necessary to include a method of differentiated service into the system. This may be in the form of off-line or on-line adjustment of the pose selection process. Some applications may prefer high-quality action recognition for a subset of the OoIs in the system, rather than uniform, but lower, performance for all objects. Finally, many of the methods inherently assume a single, global action library. For multiple TVG OoIs, there may be differences in the set of actions or even the action representation that one wishes to recognize. All pipeline stages must be designed with this consideration in mind. Lastly, there is the consideration of the structure of the method itself. While the pipeline architecture proposed in this work is ideal for single-action sensing, it is not ensured to be the best structure for other types of objects (deformable, as mentioned above), or for multiple objects. On the surface, the parallel nature of sensing multiple objects simultaneously seems to be well-suited for a pipelined system. However, sensing multiple simultaneous OoIs may actually decrease the parallelism of the problem, as the interdependence between many parts of the system may increase. As with the development of the novel framework proposed in this work, it may be best to start with a

236 211 simplified sensing task and system, and iteratively re-design the proposed method to achieve the best possible structure for the problem at hand.

237 References [1] M. Bileschi, StreetScenes: Towards Scene Understanding in Still Images. Boston, U.S.: Massachusetts Institute of Technology, Ph.D. Thesis, [2] M. Johnson-Roberson, et al., "Enhanced Visual Scene Understanding through Human-Robot Dialog," in Proceedings of AAAI 2010 Fall Symposium: Dialog with Robots, Arlington, VA., [3] D. H. Ballard and C. M. Brown, Computer Vision. Englewood Cliffs, NJ., U.S.: Prentice-Hall Inc., [4] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. U.S.: Cambridge University Press, [5] O. Faugeras, Three-Dimensional Computer Vision, A Geometric Viewpoint. Boston, MA., U.S.: MIT Press, [6] M. Sonka, V. Hlavac, and R. Boyle, Image Processing, Analysis, and Machine Vision. U.S.: PWS Publishing, [7] R. Klette, K. Schluens, and A. Koschan, Computer Vision - Three-Dimensional Data from Images. Singapore: Springer, [8] J. E. Boyd and J. J. Little, "Biometric Gait Recognition," in Biometrics School 2003, Lecture Notes in Computer Science, 2005, pp [9] M. Ekinci and E. Gedikli, "Silhouette Based Human Motion Detection and Analysis for Real-Time Automated Video Surveillance," Turkish Journal of Electrical Engineering, vol. 13, no. 2, pp , [10] K. Ntalianis, A. Doulamis, N. Tsapatsoulis, and N. Doulamis, "Human Action Analysis, Annotation and Modeling in Video Streams Based on Implicit User Interaction," in AREA08, Vancouver, B.C., Canada, 2008, pp [11] K. Sakai, Y. Maeda, S. Miyoshi, and H. Hikawa, "Visual Feedback Robot System Via Fuzzy Control," in Proceedings of SICE Annual Conference, Taipei, Taiwan, 2010, pp [12] Y. Demiris and B. Khadhouri, "Content-based control of goal-directed attention during human action perception," Interaction Studies, vol. 9, no. 2, pp , [13] D. Weinland, R. Ronfard, and E. Boyer, "Free Viewpoint Action Recognition using Motion History Volumes," Computer Vision and Image Understanding, pp. 1-20, [14] X. Zhou, B. Bhanu, and J. Han, "Human Recognition at a Distance in Video by Integrating Face Profile and Gait," in Audio- and Video-Based Biometric Person Authentication, 5th International Conference, AVBPA05, Rye Town, NY., U.S., 2005, pp [15] I. Kakadiaris and D. Metaxas, "Vision-based animation of digital humans," in Conference on Computer Animation, 1998, p [16] E. J. Ong and S. Gong, "Tracking hybrid 2D-3D human models from multiple views," in International Workshop on Modeling People at ICCV 99, Corfu, Greece, September [17] N. Jojic, J. Gu, H. C. Shen, and T. Huang, "3-D Reconstruction of multipart self-occluding objects," in Asian Conference on Computer Vision, [18] G. Roth and M. D. Levine, "Geometric primitive extraction using a genetic algorithm," in Proceedings of CVPR '92., 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Champaign, IL., U.S., 1992, pp [19] J. Kruger, B. Nickolay, and O. Schulz, "Image-based 3D-surveillance in man-robot-cooperation," in 2nd IEEE International Conference on Industrial Informatics, INDIN '04, Berlin, 2004, pp [20] S. Belongie, K. Branson, P. Dollár, and V. Rabaud, "Monitoring Animal Behavior in the Smart Vivarium," in Measuring Behavior, Wageningen, 2005, pp [21] E. Marchand and F. Chaumette, "Active Vision for Complete Scene Reconstruction and Exploration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 1, pp , Jan

238 213 [22] S. D. Roy, S. Chaudhury, and S. Banerjee, "Isolated 3-D Object Recognition through Next View Planning," IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 30, no. 1, pp , Jan [23] J. E. Banta, L. R. Wong, C. Dumont, and M. A. Abidi, "A next-best-view system for autonomous 3-D object reconstruction," IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 30, no. 5, pp , [24] R. Pito, "A Solution to the Next Best View Problem for Automated Surface Acquisition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 10, pp , Oct [25] L. Campbell and A. Bobick, "Recognition of human body motion using phase space constraints," in International Conference on Computer Vision, Cambridge, MA., U.S., 1995, pp [26] L. Goncalves, E. di Bernardo, and P. Perona, "Reach Out and Touch Space (Motion Learning)," in Proceedings of the 3rd. International Conference on Face & Gesture Recognition, Nara, Japan, [27] H. de Ruiter and B. Benhabib, "On-line modeling for real-time 3D target tracking," Journal of Machine Vision and Applications, vol. 21, no. 1, pp , Oct [28] H. de Ruiter and B. Benhabib, "Object-of-Interest Selection for Model-Based 3D Pose Tracking with Background Clutter," in Novel Algorithms and Techniques in Telecommunications, Automation and Industrial Electronics, 2008, pp [29] M. Lucena, N. Perez de la Blanca, J. M. Fuertes, and M. J. Marın-Jimenez, "Human Action Recognition Using Optical Flow Accumulated Local Histograms," Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, vol. 5524, pp , [30] Z. Zhang, M. Li, K. Huang, and T. Tan, "3D Model Based Vehicle Localization by Optimizing Local Gradient Based Fitness Evaluation," in Proceedings of 19th International Conference on Pattern Recognition, Tampa Bay, FL., U.S., 2008, pp [31] R. Vos and W. Brink, "Multi-view 3D position estimation of sports players," in Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa, 2010, pp [32] M. Kaiser, D. Arsić, S. Sural, and G. Rigoll, "Robust tracking of facial feature points with 3D Active Shape Models," in Proceedings of 17th IEEE International Conference on Image Processing (ICIP), Hong Kong, 2010, pp [33] H. de Ruiter, "3D-tracking of A Priori Unknown Objects in Cluttered Dynamic Environments," PhD Thesis, University of Toronto, Toronto, [34] A. S. Mian, M. Bennamoun, and R. Owens, "3D model-based free-form object recognition a review," Sensor Review, vol. 25, no. 2, pp , [35] X. Pan, Y. Cao, X. Xu, Y. Lu, and Y. Zhao, "Ear and face based multimodal recognition based on KFDA," in Proceedings of International Conference on Audio, Language and Image Processing, Shanghai, 2008, pp [36] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, "Shape from Shading: A Survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 8, pp , [37] P. C. Yuela, D. Q. Dai, and G. C. Feng, "Wavelet-based PCA for human face recognition," in 1998 IEEE Southwest Symposium on Image Analysis and Interpretation, Tucson, AZ., U.S., 1998, pp [38] A. A-Nasser, M. Mohammad, and A.-M. Mohamed, State of the Art in Face Recognition, 1st ed., J. Ponce and A. Karahoca, Eds. InTech, [39] M. Ahmad and S.-W. Lee, "HMM-based Human Action Recognition Using Multiview Image Sequences," in Proceedings of the 18th International Conference on Pattern Recognition (ICPR06), [40] G. Johansson, "Visual Perception of Biological Motion and a Model for its Analysis," Perception and Psychophysics, vol. 14, no. 2, pp , [41] R. Chellappa, A. K. Roy-Chowdhury, and S. K. Zhou, Human Activity Recognition. San Rafael, CA., USA.: Morgan & Claypool Publishing, [42] J. E. Cutting and L. T. Kozlowski, "Recognizing Friends by Their Walk: Gait Perception without Familiarity Cues," Bulletin of the Psychonomic Society, vol. 9, no. 5, pp , 1977.

239 214 [43] A. Kale, A. K. Roy-Chowdhury, and R. Chellappa, "Fusion of Gait and Face for Human Identification," in International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, 2004, pp [44] J. Gall, et al., "Motion Capture Using Joint Skeleton Tracking and Surface Estimation," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09), [45] Y. Liu, C. Stoll, J. Gall, H. P. Seidel, and C. Theobalt, "Markerless Motion Capture of Interacting Characters Using Multi-view Image Segmentation," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11), 2011, pp [46] T. Urano, T. Matsui, T. Nakata, and H. Mizoguchi, "Human Pose Recognition by Memory-Based Hierarchical Feature Matching," in Proceeduings of IEEE International Conference on Systems, Man, and Cubernetics, 2004, pp [47] A. Farhadi and M. K. Tabrizi, "Learning to Recognize Activities from the Wrong View Point," in Proceedings of the 10th European Conference on Computer Vision: Part I, 2008, pp [48] Q. Shi, L. Wang, L. Cheng, and A. Smola, "Discriminative human action segmentation and recognition using semi-markov model," in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK., U.S., 2008, pp [49] M. Dimitrijevic, V. Lepetit, and P. Fua, "Human Body Pose Recognition Using Spatio-Temporal Templates," in ICCV Workshop on Modeling People and Human Interaction, Beijing, China, 2005, pp [50] D. Cunado, M. S. Nixon, and J. Carter, "Automatic Extraction and Description of Human Gait Models for Recognition Purposes," Computer Vision and Image Understanding, vol. 90, no. 1, pp. 1-41, [51] G. V. Veres, L. Gordon, J. N. Carter, and M. S. Nixon, "What Image Information is Important in Silhouette-Based Gait Recognition?," in IEEE Conference on Computer Vision and Pattern Recognition, Washington, D.C., 2004, pp [52] X. Weimin, et al., "New Approach of Gait Recognition for Human ID," in International Conference on Signal Processing, Beijing, China, 2004, pp [53] N. Rajpoot and K. Masood, "Human Gait Recognition with 3D Wavelets and Kernel based Subspace Projections," in Workshop on Human Activity Recognition and Modeling (HAREM), Oxford, UK., [54] D. Xu, et al., "Human Gait Recognition with Matrix Representation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 7, pp , Jul [55] A. Sundaresan and R. Chellappa, "Multi-Camera Tracking of Articulated Human Motion Using Motion and Shape Cues," in Asian Conference on Computer Vision, Hyderabad, India, 2006, pp [56] I. A. Kakadiaris and D. Metaxas, "3D human body model acquisition from multiple views," in International Conference on Computer Vision, Cambridge, MA., U.S., 1995, pp [57] K. Rohr, Human Movement Analysis Based on Explicit Motion Models. Boston, U.S.: Kluwer Academic, [58] M. Isard and A. Blake, "CONDENSATION-conditional density propagation for visual tracking," International Journal of Computer Vision, pp. 5-28, [59] K. A. Tarabanis, P. K. Allen, and R. Y. Tsai, "A Survey of Sensor Planning in Computer Vision," IEEE Transactions on Robotics and Automation, vol. 11, no. 1, pp , Feb [60] J. Miura and K. Ikeuchi, "Task-Oriented Generation of Visual Sensing Strategies in Assembly Tasks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 2, pp , Feb [61] M. D. Naish, "Sensing-System Planning for the Surveillance of Moving Objects," Ph.D. Thesis, University of Toronto, Toronto, ON., Canada, [62] C. K. Cowan and P. D. Kovesik, "Automated Sensor Placement from Vision Task Requirements," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 3, pp , May [63] M. K. Reed and P. K. Allen, "Constraint-Based Sesor Planning for Scene Modeling," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp , Dec [64] M. Mackay and B. Benhabib, "A Multi-Camera Active-Vision System for Dynamic Form Recognition," in Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering.

240 215 Sptringer Science and Business Media B.V., 2008, pp [65] A. Mittal, "Generelized Multi-Sensor Planning," in European Conference on Computer Vision, Graz, Austria, 2006, pp [66] L. Hodge and M. Kamel, "An Agent-Based Approach to Multi-sensor Coordination," IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 33, no. 5, pp , Sep [67] S. B. Kang, P. P. Sloan, and S. M. Seitz, "Visual Tunnel Analysis for Visibility Prediction and Camera Planning," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC., 2000, pp [68] X. Gu, M. M. Marefat, and F. W. Ciarallo, "A Robust Approach for Sensor Placement in Automated Vision Dimensional Inspection," in IEEE International Conference on Robotics and Automation, Michigan, 1999, pp [69] S. Sakane, T. Sato, and M. Kakikura, "Model-Based Planning of Visual Sensors Using a Hand-Eye Action Simulator: HEAVEN," in Conference on Advanced Robotics, Versailles, France, Oct. 1987, pp [70] G. H. Tarbox and S. N. Gottschlich, "Planning for complete sensor coverage in inspection," Computer Vision and Understanding, vol. 61, no. 1, pp , Jan [71] D. P. Anderson, "Efficient Algorithms for Automatic Viewer Orientation," Computers and Graphics, vol. 9, no. 4, pp , [72] G. Backer, B. Mertsching, and M. Bollmann, "Data- and model-driven gaze control for an active-vision system," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 12, pp , Dec [73] F. G. Callari and F. P. Ferrie, "Active Recognition: Using Uncertainty to Reduce Ambiguity," in Proceedings of the 13th International Conference on Pattern Recognition, 1996, pp [74] S. J. Dickinson, H. I. Christensen, J. K. Tsotsos, and G. Olofsson, "Active Object Recognition Integrating Attention and Viewpoint Control," Computer Vision and Image Understanding, vol. 67, no. 3, pp , [75] E. Gonzalez, A. Adan, V. Feliu, and L. Sanchez, "A Solution to the Next Best View Problem Based on D- Spheres for 3D Object Recognition," in Conference on Computer Graphics and Imaging, Innsbruck, Austria, [76] M. Mackay and B. Benhabib, "Active-Vision System Reconfiguration for Form Recognition in the Presence of Dynamic Obstacles," in Lecture Notes on Computer Science, Conference on Articulated Motion and Deformable Objects, Andratx, Mallorca, Spain, 2008, pp [77] S. Yu, D. Tan, and T. Tan, "A Framework for Evaluating the Effect of View Angle, Clothing, and Carrying Condition on Gait Recognition," in International Conference on Pattern Recognition, Hong Kong, 2006, pp [78] M. D. Naish, E. A. Croft, and B. Benhabib, "Coordinated Dispatching of Proximity Sensors for the Surveillance of Maneuvering Targets," Journal of Robotics and Computer Integrated Manufacturing, vol. 19, no. 3, pp , [79] R. Murrieta-Cid, B. Tovar, and S. Hutchinson, "A Sampling-Based Motion Planning Approach to Maintain Visibility of Unpredictable Targets," Journal of Autonomous Robots, vol. 19, no. 3, pp , [80] B. Horling, R. Vincent, J. Shen, R. Becker, and K. Rawlings, "V. Lesser: SPT Distributed Sensor Network for Real Time Tracking," University of Massachusetts, Amherst, MA., Technical Report 00-49, [81] J. R. Spletzer and C. J. Taylor, "Dynamic Sensor Planning and Control for Optimally Tracking Targets," International Journal of Robotic Research, vol. 22, no. 1, pp. 7-20, Jan [82] M. Kamel and L. Hodge, "A Coordination Mechanism for Model-Based Multi-Sensor Planning," in IEEE International Symposium on Intelligent Control, Vancouver, Canada, 2002, pp [83] A. Bakhtari, M. Mackay, and B. Benhabib, "Active-Vision for the Autonomous Surveillance of Dynamic, Multi-Object Environments," Journal of Intelligent and Robotic Systems [Available On-line], Jul

241 216 [84] J. R. Spletzer and C. J. Taylor, "Sensor planning and control in a dynamic environment," in Proceedings of the IEEE International Conference on Robotics and Automation, 2002, pp [85] M. A. Otaduy and M. C. Lin, "User-Centric Viewpoint Computation for Haptic Exploration and Manipulation," in Conference on Visualization, San Diego, CA., 2001, pp [86] S. G. Goodridge, R. C. Luo, and M. G. Kay, "Multi-layered fuzzy behavior fusion for real-time control of systems with many sensors," IEEE Transactions on Industrial Electronics, vol. 43, no. 3, pp , [87] S. G. Goodridge and M. G. Kay, "Multimedia Sensor Fusion for Intelligent Camera Control," in IEEE/SICE/RSJ Multi-sensor Fusion and Integration for Intelligent Systems, Washington, D.C., 1996, pp [88] S. Stillman, R. Tanawongsuwan, and I. Essa, "A System for Tracking and Recognizing Multiple People with Multiple Cameras," in Audio and Video-Based Biometric Person Authentication (AVBPA), Washington, DC., U.S., 1999, pp [89] N. Ukita and T. Matsuyama, "Real-Time Cooperative Multi-Target Tracking by Communicating Active Vision Agents," in Proceedings of International Conference on Information Fusion, Queensland, Australia, 2003, pp [90] A. Bakhtari, "Multi-Target Surveillance in Dynamic Environments: Sensing-System Reconfiguration," Ph.D. Thesis, University of Toronto, Toronto, ON., Canada, [91] T. K. Capin, I. S. Pandzic, N. M. Thalmann, and D. Thalmann, "A Dead-Reckoning Algorithm for Virtual Human Figures," in Proceedings of the 1997 Virtual Reality Annual International Symposium (VRAIS '97), Albuquerque, NM., USA., 1997, pp [92] S. L. Dockstader and A. M. Tekalp, "Multiple camera tracking of interacting and occluded human motion," Proceedings of the IEEE, vol. 89, no. 10, pp , Oct [93] I. Haritaoglu, R. Cutler, D. Harwood, and L. S. Davis, "Backpack: Detection of people carrying objects using silhouettes," in International Conference on Computer Vision, Corfu, Greece, [94] N. Jojic, B. Brumitt, B. Meyers, S. Harris, and T. Huang, "Detection and estimation of pointing gestures in dense disparity maps," in The Fourth International Conference on Automatic Face and Gesture Recognition, Grenoble, France, [95] J. Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, "Human Activity Recognition Using Multidimensional Indexing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 8, pp , [96] J. Rittscher and A. Blake, "Classification of human body motion," in Proceedings of IEEE International Conference on Computer Vision, 1999, pp [97] Y.-R. Chen, C.-M. Huang, and L.-C. Fu, "Upper Body Tracking for Human-Machine Interaction with a Moving Camera," in IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, U.S., 2009, pp [98] J. B. Cole, D. B. Grimes, and R. P. N. Rao, "Learning Full-Body Motions from Monocular Vision: Dynamic Imitation in a Humanoid Robot," in Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA., U.S., 2007, pp [99] K. G. Derpanis, "A Review of Vision-Based Hand Gestures," Center for Vision Research, York University, Toronto, Canada, Internal Report, [100] K. G. Derpanis, R. P. Wildes, and J. K. Tsotsos, "Hand Gesture Recognition within a Linguistics-Based Framework," in Lecture Notes in Computer Science, 2004, pp [101] H. Kobayashi and F. Hara, "Facial Interaction between Animated 3D Face Robot and Human Beings," in IEEE International Conference on Computational Cybernetics and Simulation, Orlando, FL., 1997, pp [102] Y. Zhang and Q. Ji, "Active and Dynamic Information Fusion for Facial Expression Understanding from Image Sequences," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp , May [103] J. Huang and A. Amini, "Anatomical Object Volumes from Deformable B-spline Surface Models," in International Conference on Image Processing, 1998, pp

242 217 [104] N. Werghi, "Segmentation and Modeling of Full Human Body Shape From 3-D Scan Data: A Survey," IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 37, no. 6, pp , Nov [105] T. Lindeberg and J. Garding, "Shape from texture from a multi-scale perspective," in Proceedings of Fourth International Conference on Computer Vision, Berlin, Germany, 1993, pp [106] R. Plankers and P. Fua, "Articulated soft objects for video-based body modeling," in Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC., Canada, 2001, pp [107] L. Zelnik-Manor and M. Irani, "Statistical analysis of dynamic actions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp , Sep [108] D. Anguelov, D. Koller, H.-C. Pang, P. Srinivasan, and S. Thrun, "Recovering articulated object models from 3D range data," in Proceedings of the 20th conference on Uncertainty in artificial intelligence, 2004, pp [109] E. Yu and J. K. Aggarwal, "Human Action Recognition with Extremities as Semantic Posture Representation," in International Workshop on Semantic Learning and Applications in Multimedia, [110] X. Sun, M. Chen, and A. Hauptmann, "Action recognition via local descriptors and holistic features," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL., U.S., 2009, pp [111] C. Wu, A. H. Khalili, and H. Aghajan, "Multiview activity recognition in smart homes with spatiotemporal features," in Proceedings of the Fourth ACM/IEEE International Conference on Distributed Smart Cameras, 2010, 2010, pp [112] A. Y. Mubarak, "Actions Sketch: A Novel Action Representation," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1, San Diego, CA., U.S., 2005, pp [113] X. Ji and H. Liu, "Advances in View-Invariant Human Motion Analysis: A Review," IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reciews, vol. 40, no. 1, pp , [114] B. Fevery, B. Wyns, L. Boullart, J. R. L. García, and C. T. Ferrero, "Industrial robot manipulator guarding using artificial vision," in Robot Vision, U. Ales, Ed. Vukovar, Croatia: In-Tech, 2010, pp [115] N. Ahuja and E. Resendiz, "A unified model for activity recognition from video sequences," in 19th International Conference on Pattern Recognition, Tampa, FL, 2008, pp [116] A. C. Sankaranarayanan, A. Veeraraghavan, and R. Chellappa, "Object Detection, Tracking and Recognition for Multiple Smart Cameras," Proceedings of the IEEE, vol. 96, no. 10, pp , Oct [117] H. J. Seo and P. Milanfar, "Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp , Sep [118] E. Loutas, I. Pitas, and C. Nikou, "Entropy-based metrics for the analysis of partial and total occlusion in video object tracking," IEEE Proceedings of Visual Image Signal Processing, vol. 151, no. 6, pp , Dec [119] E. Loutas, K. Diamantaras, and I. Pitas, "Occlusion resistant object tracking," in Proceedings of the 2001 International Conference on Image Processing, Thessaloniki, Greece, 2001, pp [120] C. H. Messom, "Synchronisation of vision-based sensor networks with variable frame rates," International Journal of Computer Applications in Technology, vol. 39, no. 1, Aug [121] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, "Multicamera People Tracking with a Probabilistic Occupancy Map," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp , Feb [122] C. C. Chibelushi and F. Bourel, "Hierarchical multi-stream recognition of facial expressions," in International Conference on Visual Information Engineering, 2003, pp [123] J. Kilner, J..-Y. Guillemaut, and A. Hilton, "3D action matching with key-pose detection," in 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), Kyoto, Japan, 2009, pp. 1-8.

243 218 [124] L. Chunli, W. Kejun, and X. Yu, "An Action Classification Algorithm Based on MEI and LPP," in 2010 International Conference on Electrical and Control Engineering (ICECE), Wuhan, 2010, pp [125] R. Poppe, "A survey on vision-based human action recognition," Image and Vision Computing, vol. 28, no. 6, pp , Jun [126] A. Bakhtari and B. Benhabib, "An Active Vision System for Multi-Target Surveillance in Dynamic Environments," IEEE Transactions on Systems, Man, and Cybernetics, vol. 37, no. 1, pp , [127] D. Ayers and M. Shah, "Monitoring human behavior from video taken in an office environment," Image and Vision Computing, vol. 19, pp , [128] B. Chakraborty, A. D. Bagdanov, and J. Gonzalez, "Towards Real-Time Human Action Recognition," Lecture Notes in Computer Science: Pattern Recognition and Image Analysis, vol. 5524, pp , [129] P. M. Yanik, et al., "Toward active sensor placement for activity recognition," in Proceeding of 10th WSEAS international conference, [130] K. Kemmotsu and T. Kanade, "Sensor Placement Design for Object Pose Determination with Three Light-Stripe Range Finders," in Proceedings of 1994 IEEE International. Conference on Robotics and Automation, 1994, pp [131] R. Bodor, P. Schrater, and N. Papanikolopoulos, "Multi-Camera Positioning to Optimize Task Observability," in Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, 2005, pp [132] J. Biddiscombe, B. Geveci, K. Martin, K. Moreland, and D. Thompson, "Time Dependent Processing in a Parallel Pipeline Architecture," IEEE Transactions on Visualization and Computer Graphics, vol. 13, no. 6, pp , Dec [133] S. A. Williams, Programming models For Parallel Systems. New York, NY, USA: John Wiley & Sons, [134] G. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," in AFIPS Conference Proceedings, 1967, pp [135] J. L. Gustafson, "Reevaluating Amdahl's Law," Communications of the ACM, vol. 31, no. 5, pp , [136] A. E. Salama, A. K. Ali, and E. A. Talkhan, "Functional testing of pipelined processors," IEE Computers and Digital Techniques, vol. 143, no. 5, pp , [137] M. Mackay, B. Benhabib, and R. Fenton, "Active Vision for Human Action Sensing," in Proc. of CISSE 2009, [Online], 2009, p. [InPrint]. [138] Z. Pan, A. G. Rust, and H. Bolouri, "Image redundancy reduction for neural network classification using discrete cosine transforms," in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, Jul. 2000, pp [139] D. Ruprech and H. Müller, "Image Warping with Scattered Data Interpolation," IEEE Computer Graphics and Applications, vol. 15, no. 2, pp , Mar [140] D. C. Brown, "Decentering Distortion of Lenses," Photometric Engineering, vol. 32, no. 3, pp , [141] Z. Zhang, "A Flexible New Technique for Camera Calibration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp , [142] J. Heikkila and O. Silven, "A Four-step Camera Calibration Procedure with Implicit Image Correction," in IEEE Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 1997, pp [143] W. M. Lam and A. R. Reibman, "Self-synchronizing variable-length codes for image transmission," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, pp [144] Y. Tang and J. Shin, "De-ghosting for Image Stitching with Automatic Content-Awareness," in th International Conference on Pattern Recognition (ICPR), 2010, pp [145] K. Liu, Q. Du, H. Yang, and B. Ma, "Optical Flow and Principal Component Analysis-Based Motion Detection in Outdoor Videos," EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 1-6,

244 [146] H. Stewénius, F. Schaffalitzky, and D. Nistér, "How Hard is 3-View Triangulation Really?," in Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Beijing, China, 2005, pp [147] J. D. Kim and S. K. Mitra, "A local relaxation method for optical flow estimation," Signal Processing: Image Communication, vol. 11, no. 1, pp , Nov [148] G. A. Terejanu, "Extended and Unscented Kalman Filter Tutorial," University of Buffalo Tutorial Document, [149] M. St-Pierre and D. Gingras, "Comparison between the unscented Kalman filter and the extended Kalman filter for the position estimation module of an integrated navigation information system," in 2004 IEEE Intelligent Vehicles Symposium, 2004, pp [150] I.-C. Chang and S.-Y. Lin, "3D human motion tracking based on a progressive particle filter," Pattern Recognition, vol. 43, no. 10, pp , Oct [151] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, "A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking," IEEE Transactions on Signal Processing, vol. 50, no. 2, pp , Feb [152] M. D. Himmelblau, Applied Non-Linear Programming. McGraw Hill, [153] M. Raptis, K. Wnuk, and S. Soatto, "Flexible Dictionaries for Action Classification," in Proceedings of the International Workshop on Machine Learning for Vision-based Motion Analysis (MLVMA08), [154] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo, "Evaluating bag-of-visual-words representations in scene classification," in Proceedings of the international workshop on Workshop on multimedia information retrieval, [155] M. T. A. Lozano, M. Devy, J. Miguel, and S. Marti, "Perception planning for an exploration task of a 3D environment," in Proceedings of the 16th International Conference on Pattern Recognition, 2002, pp [156] M. Mackay, R. G. Fenton, and B. Benhabib, "Time-Varying-Geometry Object Surveillance Using a Multi-Camera Active-Vision System," International. Journal on Smart Sensing and Intelligent Systems, vol. 1, no. 3, pp , Sep [157] J. Shade, S. Gortler, L.-w. He, and R. Szeliski, "Layered Depth Images," in Proceedings of the 25th annual conference on Computer graphics and interactive techniques, [158] Z. Zhang, "Parameter Estimation Techniques: A Tutorial with Application to Conic Fitting," Image and Vision Computing Journal, vol. 15, no. 1, pp , [159] B. D. Kanade and T. Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision," in Proceedings of Imaging Understanding Workshop, 1981, pp [160] K. Ramath, "On the Multi-View Fitting and Construction of Dense Deformable Face Models," M.A.Sc. Thesis, Carnegie Mellon University, Pittsburgh, PA., [161] L. de Agapito, E. Hayman, and I. Reid, "Self-Calibration of Rotating and Zooming Cameras," Intational Journal of Computer Vision, vol. 45, no. 2, pp , [162] K. R. S. Kodagoda, A. Alempijevic, J. Underwood, S. Kumar, and G. Dissanayake, "Sensor Registration and Calibration using Moving Targets," in 9th International Conference on Control, Automation, Robotics and Vision, Singapore, 2006, pp [163] F. Lange and G. Hirzinger, "Calibration and Synchronization of a Robot-Mounted Camera for Fast Sensor-Based Robot Motion," in IEEE International Conference on Robotics and Automation (ICRA2005), Barcelona, Spain, 2005, pp [164] L. Montesano, J. Minguez, and L. Montano, "Modeling dynamic scenarios for local sensor-based motion planning," Autonomous Robots, vol. 25, no. 3, pp , Oct [165] F. Pagel and D. Willersinn, "Motion-based online calibration for non-overlapping camera views," in th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2010, pp [166] N. Roy and S. Thrun, "Online self-calibration for mobile robots," in IEEE International Conference on Robotics and Automation, Detroit, MI, USA, 1999, pp

245 220 [167] R. Y. Tsai, "A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses," IEEE Journal of Robotics and Automation, vol. RA-3, no. 4, pp , [168] J. Mulligan, "Empirical Modeling and Comparison of Robotic Tasks," in Proceedings of the 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems, Victoria, BC., Canada, 1998, pp [169] V. Areekul, U. Watchareeruetai, and S. Tantaratana, "Fast Separable Gabor Filter for Fingerprint Enhancement," Lecture Notes in Computer Science, vol. 3072/2004, pp , [170] B. Jähne, H. Scharr, and S. Körkel, Handbook of Computer Vision and Applications, 1st ed., B. Jahne, H. Haussecker, and P. Geissler, Eds. Academic Press, [171] R. Kountchev, V. Todorov, M. Milanova, and R. Kountcheva, "Documents Image Compression with IDP and Adaptive RLE," in IECON nd Annual Conference on IEEE Industrial Electronics, Paris, France, 2006, pp [172] D. J. Fleet and Y. Weiss, "Optical flow estimation," in Mathematical models for Computer Vision: The Handbook, N. Paragios, Y. Chen, and O. Faugeras, Eds. Canada: Springer, [173] B. Song, et al., "Tracking and Activity Recognition Through Consensus in Distributed Camera Networks," IEEE Transactions on Image Processing, vol. 19, no. 10, pp , Oct [174] R. W. Beard, T. W. McLain, M. A. Goodrich, and E. P. Anderson, "Coordinated target assignment and intercept for unmanned air vehicles," IEEE Transactions on Robotics and Automation, vol. 18, no. 6, pp , Dec [175] S.-C. Pei, W.-S. Lu, and C.-C. Tseng, "Analytical Two-Dimensional IIR Notch Filter Design Using Outer Product Expansion," IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Processing, vol. 44, no. 9, pp , Sep [176] B. Simak, M. Vlcek, and P. Zahradnik, "Analytical Design of Bandstop FIR Filters for Image Processing," in IEEE Circuits and Systems Society, ISCCSP 2006 Proceedings, [177] D. M. Weber and D. P. Casasent, "Quadratic Gabor Filters for Object Detection," IEEE Transactions on Image Processing, vol. 10, no. 2, pp , Feb [178] N. M. Kwok, Q. P. Ha, G. Fang, A. Rad, and D. Wang, "Color Image Contrast Enhancement Using a Local Equalization and Weighted Sum Approach," in The 6th annual IEEE Conference on Automation Science and Engineering (CASE2010), Toronto, ON., Canada, [179] S.-C. Hsu, S.-F. Liang, K.-W. Fan, and C.-T. Lin, "A Robust In-Car Digital Image Stabilization Technique," IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 37, no. 2, pp , Mar [180] C. F. Graetzel, B. J. Nelson, and S. N. Fry, "A Dynamic Region-of-Interest Vision Tracking System Applied to the Real-Time Wing Kinematic Analysis of Tethered Drosophila," IEEE Transactions on Automation Science and Engineering, vol. 7, no. 3, pp , Jul [181] M. U. Akram, S. Nasir, A. Tariq, I. Zafar, and W. S. Khan, "Improved fingerprint image segmentation using new modified gradient based technique," in Canadian Conference on Electrical and Computer Engineering, CCECE 2008, Niagara Falls, ON., Canada, [182] X. Guan, S. Jian, P. Hongda, Z. Zhiguo, and G. Haibin, "A Novel Corner Point Detector for Calibration Target Images Based on Grayscale Symmetry," in Second International Symposium on Computational Intelligence and Design, 2009, ISCID '09, Changsha, 2009, pp [183] A. Bovik, Handbook of Image and Video Processing, Second Edition ed., A. Bovik, Ed. Elsevier Inc., [184] R. L. de Queiroz, Z. Fan, and T. D. Tran, "Optimizing block-thresholding segmentation for multilayer compression of compound images," IEEE Transactions on Image Processing, vol. 9, no. 9, pp , Sep [185] I. E. Sampe, N. Amar Vijai, R. M. Tati Latifah, and T. Apriantono, "A study on the effects of lightning and marker color variation to marker detection and tracking accuracy in gait analysis system," in International Conference on nstrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME), Bandung, 2009, pp. 1-5.

246 221 [186] C.-H. Chuang, J.-W. Hsieh, L.-W. Tsa, P.-S. Ju, and K.-C. Fan, "Suspicious object detection using fuzzycolor histogram," in IEEE International Symposium on Circuits and Systems, Seattle, WA., USA, 2008, pp [187] H. Ukida, S. Kaji, Y. Tanimoto, and H. Yamamoto, "Human Motion Capture System Using Color Markers and Silhouette," in IMTC Instrumentation and Measurement Technology Conference, Sorrento, Italy, 2006, pp [188] H. Trichili, M..-S. Bouhlel, N. Derbel, and L. Kamoun, "A survey and evaluation of edge detection operators application to medical images," in IEEE International Conference on Systems, Man and Cybernetics, 2002, pp [189] J. Liu, A. Jakas, A. Al-Obaidi, and Y. Liu, "A comparative study of different corner detection methods," in IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), Daejeon, 2009, pp [190] A. Mueen, R. Zainuddin, and S. M. Baba, "Computational Intelligence in Robotics and Automation (CIRA)," Journal of Digital Imaging, vol. 21, no. 3, pp , Sep [191] L. Wu, Y. Wang, and Y. Liu, "Multiple targets tracking with Robust PCA-based background subtraction and Mean-shift driven particle filter," in International Conference on Computer, Mechatronics, Control and Electronic Engineering (CMCE), Changchun, 2010, pp [192] T. H. Heibel, B. Glocker, N. Paragios, and N. Navab, "Needle tracking through higher-order MRF optimization," in IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Rotterdam, 2010, pp [193] T. M. Chin, W. C. Karl, and A. S. Willsky, "Probabilistic and sequential computation of optical flow using temporal coherence," IEEE Transactions on Image Processing, vol. 3, no. 6, pp , Nov [194] A. Borkar, M. Hayes, and M. T. Smith, "Robust lane detection and tracking with ransac and Kalman filter," in 16th IEEE International Conference on Image Processing (ICIP), Cairo, 2009, pp [195] C. Wang, J.-H. Kim, K.-Y. Byun, J. Ni, and S.-J. Ko, "Robust Digital Image Stabilization Using the Kalman Filter," IEEE Transactions on Consumer Electronics, vol. 55, no. 1, pp. 6-14, Feb [196] S. Y. Chen and Y. F. Li, "Vision Sensor Planning for 3-D Model Acquisition," IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 30, no. 5, pp , Sep

247 Appendix A Common Pre-Processing Filters and Implementations There are numerous types of pre-processing filters which can be applied to remove noise and other imperfections from images. Each of these filters, in turn, may have multiple potential implementations. In general, these filters may have a significant impact on sensing-task performance when the input image is even mildly corrupted. As such, their selection is essential to any method, including the one proposed in this work. Furthermore, for this method, it is essential to select implementations which work well within the pipeline structure chosen for the system. As such, more attention is given to the selection of these filters, although the final choice must still be made by the system designer. The following is a brief summary of the most common filter types and their applications. Low-Pass Filter (LPF) Images are commonly corrupted by random or quasi-random pixel-level noise. This noise may come from multiple sources. The quantization process itself introduces quantization errors, especially along edges where there is a significant depth change in the environment. Aliasing of small features is related to this phenomenon. Noise is also introduced through random measurement error during image capture. These sources are often modeled as completely random, per-pixel noise for simplicity, and low-pass filters are used to reduce noise magnitude. If the spectral distribution of the noise is assumed to be white, an appropriate LPF can be selected to remove much of the noise without affecting useful image data. The reference implementation chosen for the LPF is the separable Gaussian LPF, which is based on the Gaussian function [169]:, (A.1) (A.2) Equation (A.1) gives the Gaussian function for one dimension, which is characterized by its standard deviation,. The 2D function in Equation (A.2),,, is assumed to have equal variance for both directions, and is simply a multiplication of two 1D Gaussian functions. If used as the basis for a LPF, several issues arise. First, a 2D Gaussian kernel is non-zero at all pixels in the image, greatly increasing the number of computations. Secondly, the kernel should be separable to further reduce the number of image elements which must be accessed with each pixel calculation. 222

248 223 The first issue is addressed by using a windowed Gaussian kernel, of size 6 6. This choice of kernel size is typical for image processing, as the result obtained will not be significantly different from using the entire distribution. The kernel is also easily separable in the X and Y directions, yielding two kernels: (A.3) (A.4) Equations (A.3) and (A.4) give the kernel elements for the Gaussian LPF kernel, where and are the separable Gaussian LPF kernels for the X and Y directions, respectively. It is assumed that 6 is odd, to allow proper indices to be selected. Figure A.1 below demonstrates the use of this LPF for various values of standard deviation. FIGURE A.1 EFFECT OF GAUSSIAN LPF AT VARIOUS STRENGTHS The use of a LPF is best suited to systems which use image-based comparison methods, such as color-based feature analysis or Principal Component Analysis (PCA). These methods directly compare image blocks, and removing noise from these blocks may potentially improve matching performance. LPFs are not well suited for edge or gradient-based methods, such as Optical Flow (OF). Information is inherently lost when applying a LPF. Ideally, the majority of this information is in the spectrum of noise, and not useful image information. However, a strong LPF will tend to reduce large image gradients and blur edges, which is exactly the information that methods such as OF use. These methods already have internal strategies to deal with outliers, so modifying their input in this manner may cause unintended consequences. In most applications the system designer must simply maintain a balance when using this filter.

249 224 Band-Stop Filter A band-stop filter is similar in application to a LPF. Given an image corrupted by noise, only with a specific, known spectral density, this filter can be applied to selectively remove noise from an image while preserving image data at higher and lower frequencies. Noise with a highly selective frequency band often originates from artificial sources. It may be an artifact of the sampling process (such as in older CCD cameras), mechanical vibration of the sensor or other parts of the environment, or even a specific object surface pattern. As such, a band-stop filter typically implements a narrow band of frequencies, tuned to the specific noise being removed. Two reference implementations were chosen, to be used depending on the nature of the noise to be removed: (i) 2-D IIR Notch Filter [175], and (ii) 2-D FIR Narrow Band-stop Filters [176]. For Option (i) above, a method based on outer product expansion, such as [175], can be used to reduce the 2-D notch filter design to two pairs of 1-D filters. The approach proposed in [175] will yield a closed-form transfer function which will satisfy bounded-input/bounded-output (BIBO) stability. Similarly, a method such as [176] can be used to implement Option (ii). This method also yields an inherently stable filter with a closed form transfer function. The choice of infinite impulse response (IIR) versus finite impulse response (FIR) is mainly left to the designer; issues of concern are the use of feedback versus ease of design. For example, the IIR reference method above is more flexible in its band selection, but harder to implement, while the FIR method is simple to design and implement, but mainly limited to symmetrical, narrow band-stop regions. Both of the above filter implementations also require the image to be converted to the frequency domain before convolution occurs, increasing the computational cost of filter operations. It should be noted that for the purpose of this framework, almost any desired filter implementation can be used, so long as it satisfies realtime processing constraints and is stable. Sample images (taken directly from [175] and [176]) from the two reference implementations above are shown in Figure A.2 below to illustrate their usage. FIGURE A.2 EXAMPLE APPLICATIONS FOR NOTCH FILTER REFERENCE IMPLEMENTATIONS A band-stop filter is also generally applicable to all images and systems. However, if the band of the noise falls within the useful frequency spectrum of the image, information will inherently be lost, as

250 225 for a LPF. It may still be useful to remove this noise along with the useful information, especially if the combined noise and data cause significant outliers or estimation error. For many algorithms, little useful information may produce more desirable operation than highly corrupted useful information, although both are likely to be worst-case scenarios. The above reference implementations are not the only filter archetypes which may be used. For example, a Gabor filter can be even more selective, removing only a specific texture and orientation in the image (e.g., [177]). In general, the system designer must carefully examine their target environment and sensing system to select one or more appropriate band-stop filters, while keeping in mind that their application is computationally costly when compared to many of the other filters being examined. Brightness and Contrast Normalization The most basic function of brightness and contrast normalization is to adjust the raw pixel intensity values of an image, such that the overall or average level of brightness is kept constant over time, and such that the overall level of contrast assumes a fixed range as well. In essence, these methods fix the mean pixel intensity to a specific value, and adjust the distribution of images pixel intensities around this mean to achieve a specified range. For a single pixel intensity,,, with image coordinates,, the most basic, generic normalization process is given by the following equations:,,,,, (A.5),,,, (A.6) Equation (A.5) is a clamping function, which fixes the value of such that. The second equation, Equation (A.6), uses this clamping function to fix the output of the normalization process to be between the lower and upper limits of the data representation, and respectively. The input intensity,, has an average intensity value,. The values of and are the selected minimum and maximum values of the image s distribution, and is the new desired range. Finally, the value of sets the new image brightness. This simple method of normalization works best for images with long-term variation in brightness (e.g., no significant changes in brightness between two closely timed images) and approximately Gaussian-distributed intensities. However, there are numerous other issues which complicate this process. Images may vary in brightness significantly over time. This means that the selection of parameters, most importantly,, and, must be continually updated over time. In many

251 226 cases, it is beneficial to calculate these values directly and uniquely for each image. Similarly, it may be necessary to maintain separate parameters for each color channel, as their distributions are not typically identical. Otherwise, color imbalance may occur. Automated camera controls, if they cannot be disabled in hardware, may also interfere with this process. In addition, images may themselves vary in brightness and contrast from one area to another. Shadows, specular highlights, and general environmental lighting variation mean that different areas of the source image will require different correction factors. As such, a general-purpose method, such as [178], is recommended, which segments the image first into multiple regions (in this case, through quasi-random segmentation), and them performs normalization in localized areas, rather than on the whole image. Obviously, this is not a perfect solution the algorithm has no way of knowing the source of a localized variation in brightness or contrast, so it may artificially introduce artifacts around shadows and highlights. Methods are also inherently limited in their ability to deal with color saturation issues caused by poor quantization. As such, the best solution is for the system designer to again use a combination of normalization processes which best suits the particular difficulties encountered in their sensing environment. One must also keep in mind that images which intuitively look good to a human observer may not be performance-optimal for the chosen vision methods, as human perception of brightness is highly non-linear, and has significantly higher dynamic range than most image sensors. An example of the results for the process used in later experiments in Chapter 6 is shown in Figure A.3 below. FIGURE A.3 BRIGHTNESS AND CONTRAST NORMALIZATION REFERENCE METHOD EXAMPLE Image Stabilization Images may also contain pixel motion which is not directly controlled through sensor motion. This may be as a result of vibration induced through motion stages or other machinery in the environment,

Dept. of Mechanical and Materials Engineering University of Western Ontario London, Ontario, Canada

Dept. of Mechanical and Materials Engineering University of Western Ontario London, Ontario, Canada ACTIVE-VISION FOR THE AUTONOMOUS SURVEILLANCE OF DYNAMIC, MULTI-OBJECT ENVIRONMENTS Ardevan Bakhtari, Michael D. Naish, and Beno Benhabib Computer Integrated Manufacturing Laboratory Dept. of Mechanical