Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine

Similar documents
Learning Pigeon Behaviour Using Binary Latent Variables Matthew D. Zeiler

A Fast Learning Algorithm for Deep Belief Nets

Static Gesture Recognition with Restricted Boltzmann Machines

Efficient Feature Learning Using Perturb-and-MAP

Two Distributed-State Models For Generating High-Dimensional Time Series

Modeling image patches with a directed hierarchy of Markov random fields

arxiv: v1 [cs.cl] 18 Jan 2015

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

Tracking of Human Body using Multiple Predictors

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

Rate-coded Restricted Boltzmann Machines for Face Recognition

Implicit Mixtures of Restricted Boltzmann Machines

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Learning Articulated Skeletons From Motion

Deep Boltzmann Machines

Using surface markings to enhance accuracy and stability of object perception in graphic displays

STA 4273H: Statistical Machine Learning

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

A SPARSE ENCODING SYMMETRIC MACHINES PRE-TRAINING FOR TEMPORAL DEEP BELIEF NETWORKS FOR MOTION ANALYSIS AND SYNTHESIS

Recurrent Neural Network (RNN) Industrial AI Lab.

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Stacked Denoising Autoencoders for Face Pose Normalization

Joint design of data analysis algorithms and user interface for video applications

Graph-based High Level Motion Segmentation using Normalized Cuts

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

FAB verses tradition camera-based motion capture systems

Articulated Characters

Latent Regression Bayesian Network for Data Representation

FROM just a single snapshot, humans are often able to

Variational Methods for Graphical Models

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Deep Generative Models Variational Autoencoders

Applications. Systems. Motion capture pipeline. Biomechanical analysis. Graphics research

Motion Capture & Simulation

Learning Switching Linear Models of Human Motion

3 : Representation of Undirected GMs

CS 223B Computer Vision Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3

Neural Networks: promises of current research

Learning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Unsupervised Discovery of Non-Linear Structure using Contrastive Backpropagation

A novel approach to motion tracking with wearable sensors based on Probabilistic Graphical Models

Phase-Functioned Neural Networks for Motion Learning

3D Object Recognition with Deep Belief Nets

CS6220: DATA MINING TECHNIQUES

Recognizing Hand-written Digits Using Hierarchical Products of Experts

Training Energy-Based Models for Time-Series Imputation

Deep Learning. Volker Tresp Summer 2014

Tree-structured approximations by expectation propagation

Thiruvarangan Ramaraj CS525 Graphics & Scientific Visualization Spring 2007, Presentation I, February 28 th 2007, 14:10 15:00. Topic (Research Paper):

CITY UNIVERSITY OF HONG KONG 香港城市大學

Advanced Introduction to Machine Learning, CMU-10715

NEURAL NETWORK VISUALIZATION

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Training Restricted Boltzmann Machines with Overlapping Partitions

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motion Interpretation and Synthesis by ICA

Motion Capture. Motion Capture in Movies. Motion Capture in Games

3D Pose Estimation using Synthetic Data over Monocular Depth Images

Visual Motion Analysis and Tracking Part II

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

Product information. Hi-Tech Electronics Pte Ltd

Probabilistic Graphical Models

Automatic Machinery Fault Detection and Diagnosis Using Fuzzy Logic

10703 Deep Reinforcement Learning and Control

Semantic Segmentation. Zhongang Qi

Synthesizing Realistic Facial Expressions from Photographs

3D Face and Hand Tracking for American Sign Language Recognition

3D Human Motion Analysis and Manifolds

Conditional Random Fields for Object Recognition

Autoencoders, denoising autoencoders, and learning deep networks

Detecting and Identifying Moving Objects in Real-Time

Style-based Inverse Kinematics

Animations. Hakan Bilen University of Edinburgh. Computer Graphics Fall Some slides are courtesy of Steve Marschner and Kavita Bala

Rate-coded Restricted Boltzmann Machines for Face Recognition

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Computer Graphics II

Deep Learning for Generic Object Recognition

/10/$ IEEE 4048

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Multiview Feature Learning

1 (5 max) 2 (10 max) 3 (20 max) 4 (30 max) 5 (10 max) 6 (15 extra max) total (75 max + 15 extra)

Disparity Search Range Estimation: Enforcing Temporal Consistency

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

12.1 Quaternions and Rotations

GESTURE DYNAMICS MODELING FOR ATTITUDE ANALYSIS USING GRAPH BASED TRANSFORM. Zhaojun Yang, Antonio Ortega and Shrikanth Narayanan

Epitomic Analysis of Human Motion

A Simple Generative Model for Single-Trial EEG Classification

Hand-Eye Calibration from Image Derivatives

Predicting 3D People from 2D Pictures

Interpolation and extrapolation of motion capture data

Quaternions and Rotations

Data-driven Approaches to Simulation (Motion Capture)

Tracking Pedestrians using Local Spatio-temporal Motion Patterns in Extremely Crowded Scenes

A Statistical Consistency Check for the Space Carving Algorithm.

Chapter 3 Image Registration. Chapter 3 Image Registration

Transcription:

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine Matthew D. Zeiler 1,GrahamW.Taylor 1, Nikolaus F. Troje 2 and Geoffrey E. Hinton 1 1- University of Toronto - Dept. of Computer Science Toronto, Ontario M5S 2Z9 - Canada 2- Queen s University - Dept. of Psychology Kingston, Ontario K7L 3N6 - Canada Abstract. In an effort to better understand the complex courtship behaviour of pigeons, we have built a model learned from motion capture data. We employ a Conditional Restricted Boltzmann Machine (CRBM) with binary latent features and real-valued visible units. The units are conditioned on information from previous time steps to capture dynamics. We validate a trained model by quantifying the characteristic headbobbing present in pigeons. We also show how to predict missing data by marginalizing out the hidden variables and minimizing free energy. 1 Introduction Recent studies investigating the complex courtship behaviour of the pigeon, Columbia livia, demonstrate that pigeons show courtship responses not only to real partners, but also to video [1]. More recently, social behaviour in pigeons has been elicited by a virtual pigeon, driven by motion capture (mocap) data gathered from a real pigeon and rendered through a computer graphics engine [2]. Investigating avian social perception is an interesting problem, as it presents a sandbox in which we can experiment to better understand interactive social behaviour. A natural next step is to drive a virtual pigeon not by mocap data, but by a model learned from such data. This will give researchers both control over and insight into the complex factors underlying courtship behaviour. The success of a temporal extension of Restricted Boltzmann Machines on modeling human motion [3] has prompted us to consider this powerful class of models for learning on mocap data captured from both single pigeons and pairs of pigeons in courtship. These models can learn complex dynamics directly from data, without imposing physics-based constraints. Their generative nature permits online synthesis of novel motion using only the learned weights and a few valid frames for initialization. Like a Hidden Markov Model (HMM), they can capture nonlinearities in the observation but their distributed binary hidden state is exponentially more powerful than the K-state multinomial used by an HMM. This allows the CRBM to capture longer-term dependencies. In this paper, we concentrate on learning a generative model of the motion of a single pigeon. Before modeling courtship behaviour, we must first confirm that our model is capable of capturing the subtleties of pigeon motion, notably the characteristic head-bobbing [4]. We also focus on the more practical problem of predicting the location of the feet which are frequently occluded during capture.

2 The Conditional Restricted Boltzmann Machine The CRBM is a non-linear generative model for time-series data that uses an undirected model with binary latent variables, h, connected to a collection of visible variables, v. The visible variables can use any distribution in the exponential family. We use real-valued Gaussian units in our experiments. At each time step, v and h receive directed connections from the visible variables at the last few time-steps. The model defines a joint probability distribution over v and h, conditional on the past N observations and model parameters, θ: p(v, h {v} t 1 t N,θ)=exp( E ( v, h {v} t 1 t N,θ) /Z E ( v, h {v} t 1 t N,θ)) = (v i b i )2 2σ 2 h j b j v i w ij h j (1) i i σ j ij i where Z is a constant called the partition function which is exponentially expensive to compute exactly. The dynamic biases, b i,b j, are affine functions of the past N observations. Such an architecture makes on-line inference efficient and allows us to train by minimizing contrastive divergence (for details, see [5]). Taylor et al. [3] showed that after training a CRBM on mocap data, a single model could synthesize novel motion of various styles without the need to keep a database of valid motions. The model was also used to perform on-line filling in of data lost during motion capture. An important feature of the CRBM is that once it is trained, we can add layers like in a Deep Belief Network [6]. The previous layer CRBM is kept, and the sequence of hidden state vectors, while driven by the data, is treated as a new kind of fully observed data. The next level CRBM has the same architecture as the first (though it has binary visible units and we can change the number of hidden units) and is trained in the exact same way (Fig. 1). Upper levels of the network can then model higher-order structure. More layers aid in capturing multiple styles of motion, and permitting transitions between these styles [3]. 3 Data gathering and preprocessing Markers were placed at various locations on the head, torso, and both feet of a pigeon to capture the local movements of each body segment. Each pigeon was allowed to walk in an enclosed area and was recorded with an array of synchronized cameras. The collected data was cleaned to account for sensor noise and occlusion, providing (x,y,z) positions of each marker in mm with respect to a global coordinate system. This data was converted to a hierarchy of coordinate systems relative to a body coordinate frame. The origin of the local coordinate systems for each foot and the head were defined as a translation and rotation relative to the body frame, which was in turn relative to the global coordinate system. All rotations were converted to an exponential map representation. As in [3], the root segment was expressed in a body-centred coordinate system which is invariant to groundplane translations and rotations about the gravitational vertical. Finally all

k 25 2 j z (mm) 15 1 i 5 t-2 t-1 t 5 1 y (mm) 15 2 25 1 5 5 x (mm) 1 15 Fig. 1: Architecture of atwo-layercrbm Fig. 2: The hold-phase present in generated motion. RGB points mark each frame s xyz axes. data was scaled to have zero mean and unit variance (data is rescaled before measurement or playback). The final representation was 24 real values per frame: 6 degrees of freedom (dof) for each of 4 segments. We included all translational dof to account for the articulated, multi-segment nature of the neck and legs. 4 Experimental setup and discussion 4.1 Generation of novel motion Following the procedure of [3], we trained a Gaussian-binary CRBM on 11583 frames of pigeon mocap data at 12 fps. A binary-binary CRBM was then trained on the real-valued probabilities of the hidden layer while driven by the training data. Each CRBM had 6 hidden units. All parameters used a learning rate of 1 1 3, except for the autoregressive weights (1 1 5 ). A momentum term was also used:.9 of the previous accumulated gradient was added to the current gradient. We used CD(1), i.e. 1 steps of alternating Gibbs sampling per each iteration of contrastive divergence. Both layers were conditioned on 12 previous frames. Generation of each frame involved alternating Gibbs sampling in the top two layers, followed by a single downward pass in the first CRBM. The previous frame plus a small amount of Gaussian noise was used for initialization. If we are to carry out experiments to elicit response of real pigeons from a virtual pigeon driven by our model, we must ensure that our synthesized data captures the subtleties characteristic of pigeon motion. In addition to fixed foot plants, pigeons demonstrate complex head-bobbing, defined by a distinct thrust phase and hold-phase where the head actually remains stationary [4]. We were able to generate pigeon motion that closely resembled the true motion capture data (Fig. 2). The collection of head coordinate frames shows the distinct hold-phase present in the generated motion. For videos of generated motion see: http://www.matthewzeiler.com/videos/. Although the hold-phase is visually present, we have sought a quantitative

comparison to the real motion capture data. The hold-phase can be quantified by measuring rotation about the head frame s yaw axis and the path length of the head in the ground-plane with respect to the global coordinate system. A visual comparison of training data against data generated from our 2-layer CRBM model is shown in Fig. 3 along with a baseline model: a 12th order autoregressive (AR) model fit by regularized least squares. mm mm mm 1 5 Path Length Training Data 2 4 6 8 Path Length 2 L CRBM Generation 1 5 2 4 6 8 Path Length AR(12) Generation 1 5 2 4 6 8 degrees degrees degrees Yaw Rotation Training Data 2 4 6 8 2 4 6 8 Yaw Rotation 2 L CRBM Generation 4 2 2 2 4 6 8 Yaw Rotation AR(12) Generation 5 5 2 4 6 8 Fig. 3: Hold-Phases present in path length and yaw angles (shaded grey). Each hold phase was detected by determining where the empirical second derivative of the respective time series was either a minimum or a maximum, which corresponds to the leveling of the hold-phase regions. In these regions the standard deviations were calculated and the mean of all these regions, along with corresponding calculation for the thrust-phases are shown in Table 1. Model/Phase Path Length (mm) Yaw Angle (degrees) Training data Hold Phase.58.96 Training data Thrust Phase 3.87 23.9 2-layer CRBM Hold Phase 3.58 2.71 2-layer CRBM Thrust Phase 31.18 19.69 AR(12) Hold Phase 11.54 6.68 AR(12) Thrust Phase 21.18 13.65 Table 1: Comparison of Mean Standard Deviation in Hold and Thrust Phases. Data generated from the 2-layer CRBM clearly captures the two distinct phases as seen in Fig. 3 and in the differences between standard deviations. Also, since our model is stochastic, the std. dev. in the hold phase is greater than that of the mocap data. Since we are modeling ground-plane velocities instead of positions, noise introduced by sampling is integrated up and accumulates

during post-processing. The reason why our model exhibits a smaller standard deviation in the hold-phase for rotation compared to translation, may be an artifact of normalizing the data dimensions before training. While translational dimensions are expressed in mm before scaling (typically large values), rotational dimensions are expressed in exponential maps (small values) and so noise in the translational dimensions is exaggerated, relative to the rotations, when we rescale. The precision of the Gaussian units, 1/σ i, has been fixed to 1 in all of our experiments since it tends to work well in practice [3]. Either learning this parameter, or using a larger fixed value may improve these results. 4.2 Prediction of missing foot markers Due to the relative size of torso and feet, as well as the swelling of feathers, markers on the feet of pigeons are frequently occluded. Since our model would benefit from the availability of more data, we would like to exploit all possible mocap trials. One option would be to ignore, or marginalize out any missing feet while training. Another is to use a CRBM trained on complete data to predict the missing feet. We carried out a series of experiments in which a 1- layer and 2-layer CRBM are used to predict the 6 dof corresponding to the right foot, given the other 18 dof. We compare to several baseline methods: a nearest neighbor search through the training set (based on the known data dimensions), an AR(12) model, as well as 1 and 2 layer static RBM models that do not use temporal information. 1-step prediction results are shown in Table 2. Model Mean Error (mm) Nearest Neighbor 63.73 AR(12) 2.68 1-L RBM 26.43 ±.1 2-L RBM 31.31 ±.7 1-L CRBM (FE) 5.61 2-L CRBM (Gibbs) 8.27 ±.4 Table 2: Average 1-step prediction error for 7767 test frames. The std. dev. over 25 trials for stochastic models is also reported. Mean Error (mm) 35 3 25 2 15 Comparison of N Step Generation Errors 1 AR(12) 1 L CRBM (FE) 5 2 L CRBM (Gibbs) 5 1 15 2 25 3 Step Size (N) Fig. 5: Comparison of various models for N-step prediction. For the 1-layer CRBM, we used a unique method to fill in missing markers. A method based on Gibbs sampling could be used for prediction by initializing unknown markers with their values at the previous time step and reconstructing only these markers on alternating Gibbs sampling steps as described in [3]. The downside of this approach, is that it is subject to noise. A better method may be to exploit the fact that the hidden variables are binary, and integrate them out

to arrive at the free energy, given the model parameters and past observations: F (v {v} t 1 t N,θ)= (v i b i )2 2σ 2 ( log 1+exp( ) v i w ij + b j i i σ ). (2) j i i F is the negative log probability of an observation plus log Z (see Eq. 1). A setting of v that minimizes the free energy is found by fixing the known v i and following the gradient of F with respect to the unknown v i. In our tests we use a conjugate-gradient method initialized with the values at the previous frame. One method for filling in missing data using the 2-layer CRBM is to generate as described earlier, but only replace missing dimensions during the final downpass through the 1st CRBM. This has the disadvantage that sampling hiddens does not take into account the known visible dimensions of the current frame. Prediction can be improved by linearly blending both top-down input (from CRBM 2) and bottom-up input (from CRBM 1) to the first hidden layer. For models using temporal information, we also carried out N-stepprediction(Fig. 5), showing that multi-layer models can capture longer-term dependencies. 5 Conclusion We have proposed a model for pigeon motion based on a Conditional Restricted Boltzmann Machine. A model trained on mocap data from a single pigeon can synthesize realistic data. This is verified by videos, and through quantitative analysis of head-bobbing. The model can be used to fill in missing foot data. Our results show that minimizing free energy with respect to the missing variables gives superior results to the Gibbs sampling method proposed in [3] for short term predictions. The benefits of adding a second layer are evident in longer-term prediction. Future work will be to drive a virtual pigeon using computer graphics with data generated by our model in an attempt to elicit social behaviour in pigeons. We also hope to train CRBMs on data captured from pairs of pigeons. This joint model could be used to predict the behaviour of one bird given the motion of another. References [1] B.J. Frost, N.F. Troje, and S. David. Pigeon courtship behaviour in response to live birds and video presentations. In 5th International Congress of Neuroethnology, 1998. [2] S. Watanabe and N.F. Troje. Towards a virtual pigeon : A new technique to investigate avian social perception. Animal Cognition, 9:271 279, 26. [3] G.W. Taylor, G.E. Hinton, and S.T. Roweis. Modeling human motion using binary latent variables. In NIPS 19, pages 1345 1352, Cambridge, MA, 27. MIT Press. [4] Nikolaus F. Troje and Barrie J. Frost. Head-bobbing in pigeons: How stable is the hold phase? The Journal of Experimental Biology, 23(4):935 94, 2. [5] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput, 14(8):1771 18, Aug 22. [6] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Comp., 18(7):1527 1554, 26.