Learning Pigeon Behaviour Using Binary Latent Variables Matthew D. Zeiler

Size: px

Start display at page:

Download "Learning Pigeon Behaviour Using Binary Latent Variables Matthew D. Zeiler"

Thomasina Warner
6 years ago
Views:

1 Learning Pigeon Behaviour Using Binary Latent Variables By Matthew D. Zeiler Supervisor: Geoffrey E. Hinton April 29

2 Abstract In an effort to better understand the complex courtship behaviour of pigeons, we have built a model learned from motion capture data. We employ a Conditional Restricted Boltzmann Machine with binary latent features and real-valued visible units. The units are conditioned on information from previous time steps in a sequence to learn long-term effects and infer current features. We validate a trained model by quantifying the characteristic head-bobbing present in generated pigeon motion. We also introduce a method of predicting missing data by marginalizing out the hidden variables and minimizing the free energy of the model. An alternative prediction method using forward and reverse passes over gaps of missing markers was presented as well. Lastly, the effects of head and foot motion on prediction results were analyzed. Website: ii

3 Acknowledgements I would like to thank Professor Geoffrey E. Hinton of the Department of Computer Science at University of Toronto for supervising my research on this project and providing resources for the experimentation. Additionally, his research in the field of machine learning and neural networks is second to none, and without that, the research within this thesis would not have been possible. Further, I thank Graham W. Taylor, a PhD candidate of this department under Geoffrey E. Hinton and Sam T. Roweis, for the exceptional patience, guidance, and advice that he provided throughout this thesis work. His expertise with Conditional Restricted Boltzmann Machines and their application to motion capture and animation was the foundation of my research and being able to work with him to further the applications of these models has been very rewarding. Lastly, I would like to thank Nikolaus F. Troje from the Psychology Department at Queen s University for providing the pigeon motion capture data and the previous research that inspired this project. iii

4 Contents List of Symbols List of Figures List of Tables v vi vii 1 Introduction Overview Literature Review Understanding Pigeon Behaviour Alternative Models The Conditional Restricted Boltzmann Machine Data Gathering and Preprocessing Motion Capture Setup Body Centered Coordinates Ground Plane Invariance Data Normalization and Batch Preparation Motion Generation Generation of Novel Motion Quantifying Head Motion Prediction of Missing Foot Markers Marker Occlusion Method 1: Alternating Gibbs Sampling Method 2: Free-Energy Minimization N-step Prediction Results Bi-Directional Gap Filling Motion Effects on Frame Prediction Trained Model Substitution Conclusion 27 6 Recommendations for Future Work 28 References 3 A Coordinate Transformations 31 iv

5 List of Symbols AR CD CRBM dof fps HMM LDS MoCap RBM TRBM Auto-Regressive Contrastive Divergence Conditional Restricted Boltzmann Machine degrees of freedom frames per second Hidden Markov Model Linear Dynamical System Motion Capture Restricted Boltzmann Machine Temporal Restricted Boltzmann Machine v

6 List of Figures 1 Pigeon video test setup at Queen s University Boltzmann Machine with 3 visible units and 2 hidden units Restricted Boltzmann Machine with 3 visible units and 2 hidden units Architecture of a 3 rd order CRBM The CRBM used as a building-block in deep architectures Motion capture setup at Queen s University Biomotion Lab Vicon MX-2 series 1.3 Megapixel motion capture camera Example motion capture marker placement and derived coordinate frames Stroboscopic image from a walking pigeon Example hold-phases present in generated motion Hold-Phases present in path length and yaw angles (shaded grey) layer CRBM showing prediction problem Effects of changing the ratio of visible influense, (λ) Comparison of various models for N-step prediction Comparison of 2-layer CRBM predictions for right foot, left foot, and both feet The effect of changing the forward model weighting (A F ) Effects of motion on frame prediction errors for various prediction techniques Forward/Reverse 1-layer CRBM s used for bi-directional prediction Queen s University Biomotion Lab virtual pigeon vi

7 List of Tables 1 Notation for Coordinate Frames Comparison of Mean Standard Deviation in Hold and Thrust Phases Average 1-step prediction errors for 773 test frames Mean prediction errors for various methods over 773 test frames vii

8 1 Introduction 1.1 Overview Investigating avian social perception is an interesting problem, as it presents a sandbox in which we can experiment to better understand interactive social behaviour. In this thesis, time-series neural network models are used to capture key characteristics of avian behaviour. To test the modeling capacity of these networks, various experiments are conducted which evaluate the potential of the model. The hope is to provide researchers both control over and insight into the complex factors underlying courtship behaviour. The models used are first trained on motion capture (MoCap) data collected from various pigeon specimens. Once a model of the data is learned, it can be used to either generate motion, predict positions of body segments, or to provide useful animation for determining the social interactions between real pigeons. Due to the complex data set and long sequences, a powerful model is required to capture all relevant subtleties of pigeon motion in order to truly provide suitable interactions with a real pigeon. To complete this task, a time-series extension of a Boltzmann Machine was utilized. Before the experiments can be explained however, some background into previous work must be considered. 1.2 Literature Review Understanding Pigeon Behaviour Recent studies investigating the complex motion of the pigeon, Columbia livia, demonstrate distinct patterns present throughout walking sequences [1]. The pigeon walk can be characterized by its distinct head-bobbing which consists of a hold-phase in which the head displays no translation or rotation. Between such hold-phases are periods of rapid changes in position and orientation of the head called the thrust-phases of the motion. By quantifying these two phases and comparing them between pigeons in courtship or in regular walking motion, differences can be noted [2]. These differences help researchers classify pigeon behaviour and thus measuring these quantities in the output from our models could provide insight into how well our models learn the subtleties of pigeon motion. Further, studies investigating the complex courtship behaviour of pigeons demonstrate that courtship responses can occur not only with real partners, but also with video [3]. Confining a pigeon within a small box in a dark room as shown in Figure 1 and displaying video of another pigeon exhibiting courtship behaviour immediately elicits a response in the live pigeon. More recently, social behaviour in pigeons has been elicited by a virtual pigeon, driven by motion capture (Mocap) data gathered from a real pigeon and rendered through a computer graphics engine [4]. Being able to generate novel motion onto which the computer graphics engine can be applied would allow researchers to test more precisely what criteria in the motion garners a specific response. Thus, this thesis also aims to generate realistic pigeon motion with the future possibility 1

Figure 1: Pigeon video test setup at Queen s University. of displaying it to a live pigeon to determine if the model is powerful enough to create a response. 1.2.

9 Figure 1: Pigeon video test setup at Queen s University. of displaying it to a live pigeon to determine if the model is powerful enough to create a response Alternative Models Motion capture data requires a powerful learning algorithm to capture all its relationships as it is very nonlinear and has multiple inherent characteristics that are related to one another (componential structure). A standard model for sequential data used by researchers is the Hidden Markov Model (HMM) which, for example, has been used by Bregler to recognize motion in video sequences [5]. The HMM utilizes a combination of hidden states and observable outputs with respective transition and output probabilities. Training these models reduces to a problem of determining the transition probabilities between hidden states and the output probabilities given a hidden state that produces an observable sequence (see [6] for more details). Due to their structure, Hidden Markov Models are unable to efficiently model the complex motion capture data used in these experiments. Since HMMs use a K-state multinomial to model time-series data, their modeling capacity increases with an exponential growth in the number of hidden states. In order to model N bits of past information, the number of hidden states would need to grow exponentially as 2 N. Due to this explosive growth in the number of HMM parameters required to capture features of the data, it becomes intractable to use this type of model efficiently. A distributed model capable of representing the varied information efficiently over a set of states is therefore desirable. One such distributed model is called a Linear Dynamical System (LDS). While these models are distributed, they suffer from not being able to deal with highly non-linear data such as the motion capture data used here. For static data sets, the Boltzmann Machine provides h 1 h 2 v 1 v 2 v 3 Figure 2: Boltzmann Machine with 3 visible units and 2 hidden units. 2

10 both a distributed representation of the data along with non-linear modeling capabilities. Though a Boltzmann Machine can consist of only a collection of visible units, comparable in size to given data vectors, these models can also have multiple layers of hidden units. These latent variables, as they are called, can aid in modeling abstract features of the data. Shown in Figure 2 is an example Boltzmann Machine composed of a visible layer of units corresponding in size to the data and a hidden layer of units which can be of an arbitrary size. The total input z i to each unit i from all other units, each denoted as j, is given as: z i = b i + j s i w ij (1) where w ij is the weight between units i and j. The b i term in Equation 1 is a bias input on unit i which is equivalent to a connection from another unit whose output is always 1. With the total input, z i, typically being passed through a sigmoid function, unit i turns on (having a value of 1) with probability: p(s i = 1) = e z i (2) Using Equation 2, the network of units can be updated one at a time. Doing so for a long period of time will allow the network to reach a Boltzmann distribution where the probability of a state vector P(v) is determined by its energy E(v): P(v) = ee(v) u ee(u) (3) E(v) = i s i v b i i<j s i v s j v w ij (4) The update rule for this model has a convenient, simple form. The updates to weights w ij are proportional to the expected values of state settings when the visible units are clamped to the data, s i s j data, compared to an equilibrium prediction of the data, s i s j model : w ij s i s j data s i s j model (5) where s i s j model is computed by alternating Gibbs sampling between the hidden and visible units until equilibrium is reached. Notice how the updates are local in the sense that they only depend on the values of s i and s j. A similar update rule applies to the bias weight updates b i : b j s j data s j model (6) All units in a Boltzmann Machine are fully connected, which is a computational limitation of this model when it comes to learning since reaching an equilibrium distribution become in- 3

11 tractable [7]. To make the learning procedure significantly faster a couple things can be done: 1) restrict the connections between units in the model, and 2) modify the learning procedure. If the connections between units are restricted to only occur between visible and hidden units (remove the hidden-hidden and visible-visible connections as shown in Figure 3), the hidden units become conditionally independent given a visible vector. This type of model is known as a Restricted Boltzmann Machine (RBM). Restricting the connections allows the units to be updated in parallel to provide unbiased samples of s j s i data in one step. Additionally, once the connections are restricted, a sample of s j s i model can be obtained by alternating parallel updates of h 1 h 2 v 1 v 2 v 3 Figure 3: Restricted Boltzmann Machine with 3 visible units and 2 hidden units. the visible and hidden units. To reach a true equilibrium distribution, these updates would have to repeat for an infinitely long time. However, for practical purposes, only a few iterations of these parallel alternating updates are required to get s j s i reconstruction which is a good approximation to s j s i model. This procedure can be summarized as follows: 1. Clamp a data vector to the visible units and update the hidden units in parallel. 2. Using the activations of the hidden units, update the visible units in parallel to get a reconstruction of the data. 3. Update all hidden units using the reconstructed visible units and repeat step 2-3 for the desired number of iterations. This procedure is an approximate gradient decent in a quantity known as contrastive divergence (CD) [7]. Another result of the connections being restricted is the ability of the RBM models to be stacked, which has been shown to improve a variational bound on the probability of the data [8]. This way layers of hidden units can be used to model additional abstract features of the data at multiple levels. When using multiple levels, a further increase in the training speed results from training each layer separately. Training one layer of hidden units using data vectors applied to the visible units is simple. Using the activations of hidden units in the trained layer while a given data vector is clamped to the visible units of this layer gives a training vector for the layer above. A complete set of these activation vectors can be formed using all the data vectors from the training set individually clamped at the visible units. This process of using activations of hidden units to train the layers above can be repeated for any number of layers in the model. Once trained, the RBM model can be used to both discriminate and generate by clamping specific units [9]. By adding labels in higher hidden layers and smoothing the gradients with backpropagation once initialized with the results of the training procedure outlined above, a clamped 4

12 visible vector can be classified by the labels through sampling. Alternatively, clamping a label can provide adequate information for the model to generate visible vectors that represent data related to that label. This generative ability is what makes RBMs particularly interesting for determining the amount and generality of information learned by the model as we wish to do throughout this thesis. While the Restricted Boltzmann Machine is capable of learning non-linear distributed representations of the data efficiently and generating visible vectors based on the learned parameters, it lacks temporal information which would make it suitable for modeling time series data such as motion capture. Since much knowledge of pigeon behaviour can be learned from motion capture data of pigeons participating in courtship routines, a model that captures temporal dependencies is desired. One capable model used throughout this research is discussed in the following section. 1.3 The Conditional Restricted Boltzmann Machine The success of a temporal extension of Restricted Boltzmann Machines in modeling human motion [1] has prompted us to consider this powerful class of models for learning on MoCap data captured from both single pigeons and pairs of pigeons in courtship. Unlike other neural network models that have been used to learn the physical constraints of a system [11], the power of this model can be used learn the dynamics as apposed to defining them analytically [1]. This has previously been done by training a Conditional Restricted Boltzmann Machine (CRBM) on motion capture data of human walking, running, and other motions [1]. A similar approach is to be applied to model pigeon behaviour. The CRBM is a non-linear generative model for time-series data which is a special case of the Temporal Restricted Boltzmann Machine (TRBM) [8], in which there are no temporal connections between hidden units. This makes filtering in the CRBM exact, and mini-batch learning possible, as training does not have to be done sequentially. This latter property can greatly speed up learning as well as smooth the learning signal, as the order of data vectors presented to the network can be randomized. The CRBM model is based on the RBM model at the current time step, but has directed connections from visible units at previous time steps to both the hidden and visible units of the current time step as shown in Figure 4. Various connections are indicated in the figure, including: directed visible-hidden weights (red), directed visible-visible weights (green), undirected visible-hidden weights (blue), and static biases (purple). The directed connections to the hidden and visible units t 3 t 2 t 1 t h Figure 4: Architecture of a 3 rd order CRBM. act as dynamic biases input into each set of units. Unlike the static biases that are present in the model which have a constant input of 1, the dynamic biases carry the value of the previous visible unit to which they are attached. The undirected connections at the current time step connect between the binary latent vari- v 5

13 ables, h and the visible variables, v. Typically for RBMs and CRBMs, the latent variables have stochastic binary states instantiated by sampling the inputs to each hidden unit. Alternatively, a mean-field setting could be used, in which case the result of Equation 2 becomes the output of the unit. In this thesis, both approaches are used in select instances to improve results. Instead of binary units for the visible variables, these models can use any function that is part of the exponential family [12]. These exponential functions have a linear effect when considering log probabilities throughout the model. In modeling motion capture for this thesis, Gaussian units were chosen which evaluate to the total input plus additive Gaussian noise at the output. At each time step, v and h receive directed connections from the visible variables at a certain number of previous time-steps, the number of which defines, N, the order of the model. The CRBM model defines a joint probability distribution over v and h, conditional on the past N observations and model parameters, θ: p(v,h {v} t 1 t N,θ) = exp( E ( v,h {v} t 1 t N,θ)) /Z E ( v,h {v} t 1 t N,θ)) = i (v i b i )2 2σ 2 i j h j b j ij w ij v i σ i h j (7) where Z is a constant called the partition function which is exponentially expensive to compute exactly. The dynamic biases, b i,b j, are affine functions of the past N observations. Such an architecture makes on-line inference efficient and allows us to train by minimizing contrastive divergence [7]. To train a CRBM model, the weight updates for the symmetric connections and the static biases remain the same as in the RBM training procedure (Equations 5 and 6). The directed connections from the previous time steps, t 1,t 2,...,t N, to the current hidden units have a slightly different weight update rule: d (t q) ij v (t q) i h t j data v (t q) i h t j reconstruction (8) where d (t q) ij is the weight between visible unit i at time t q to hidden unit j for q = 1...N where N is the order of the model. Additionally, there is a new weight update rule for the visible unit dynamic bias weights, which are often referred to as the auto-regressive weights: a (t q) ki v (t q) k vj t data v (t q) k vj t reconstruction (9) where a (t q) ki connects visible unit k at time t q to visible unit i at the current time t for q = 1...N. Each weight update is scaled by a small learning rate to prevent the training from overshooting local or perhaps global optima. Typically the auto-regressive weights have a smaller learning rate than the rest of the connections because the correlations between the visible units at previous time steps and those at the current time step are much stronger than other pairwise correlations. Just as with the static RBM model, an important feature of the CRBM is that once it is trained, 6

14 we can add layers like in a Deep Belief Network [9]. The previous layer CRBM is kept, and the sequence of hidden state vectors, while driven by the data, is treated as a new kind of fully observed data. The next level CRBM has the same architecture as the first but now has binary visible units to match the hidden units in the previous layer, while the number of hidden units in this next layer can be set arbitrarily (see Figure 5). This next layer is trained in the same way as the previous. t 3 t 2 t 1 t h 2 h 1 Figure 5: The CRBM used as a building-block in deep architectures. v This greedy procedure is justified using a variational bound [9]. Following greedy learning, both the weights and the simple inference procedure are suboptimal in all but the top layer of the network, as the weights have not changed in the lower layers since their respective stage of greedy training. However, a contrastive form of the wakesleep algorithm [13] called the up-down algorithm [9] can be used to fine-tune the generative model. Fine-tuning has been observed to improve the quality of generated sequences. Once trained, the CRBM model can generate motion by supplying a few initialization frames of MoCap and alternating updates of the units in the model. More layers can also aid in capturing multiple styles of motion, and permitting transitions between these styles as was shown by Taylor et al. [1] when training a single CRBM on MoCap data of different human walking styles. The applicability of these models for representing pigeon motion is in their ability to generate time-series data. As with RBMs, the CRBM could potentially be used to classify labeled data if the layout was modified slightly to include labeled units. However, it is the generation of novel motion from a trained model that we can use to analyze pigeon behaviour. In this thesis, one and two layer models are tested in their ability to act as generative models of pigeon motion. They are compared against other approaches in their ability to capture the subtleties of the motion, notably the characteristic head-bobbing [1], and their capabilities of predicting the location of feet which are frequently occluded during motion capture. 7

2 Data Gathering and Preprocessing 2.1 Motion Capture Setup The data set used for training and testing was motion capture provided by Nicholaus Troje from the Biomotion Lab at Queen s University.

To ensure the marker positions on each segment of the pigeon did not vary with respect to each other, cardboard was placed on the head and back to hold those markers in place.

15 2 Data Gathering and Preprocessing 2.1 Motion Capture Setup The data set used for training and testing was motion capture provided by Nicholaus Troje from the Biomotion Lab at Queen s University. The data was captured using markers at various locations on four segments of the pigeon: the head, torso, and both feet. To ensure the marker positions on each segment of the pigeon did not vary with respect to each other, cardboard was placed on the head and back to hold those markers in place. Each pigeon was then recorded with an array of 12 synchronized cameras while allowed to walk in an enclosed area. The setup is shown below in Figure 6 along with one of the 12 Vicon MX Megapixel cameras used to capture the data (Figure 7). Available data sets contained various standing, walking, and courtship sequences of either one or two pigeons. All experiments were done using the data from a single pigeon. This data was either derived from motion capture data of a courtship sequence between two pigeons or from that of a single pigeon reacting to another. The collected data was cleaned to account for sensor noise and occlusion, providing (x,y,z) positions of each marker in mm with respect to a global coordinate system. Figure 6: Motion capture setup at Queen s University Biomotion Lab. Figure 7: Vicon MX-2 series 1.3 Megapixel motion capture camera. 2.2 Body Centered Coordinates As a first attempt to preprocess the data into a form suitable for learning, the pigeon segments were converted into a hierarchy of coordinate frames. From the (x,y,z) positions of each marker output from the camera system, it is desired to model the body, head, and each foot as separate coordinate frames, each with six degrees of freedom (dof). 8

16 z (mm) y (mm) x (mm) 25 2 Figure 8: Example motion capture marker placement and derived coordinate frames. An example of this conversion is shown in Figure 8. In this figure, black circles represent motion capture markers used to define each origin (black dots), and the red circles are the remaining markers. Coordinate frame origins are connected with black lines and the unit vectors for x, y, and z directions are represented by red, green, and blue lines respectively. To begin this process of conversion for the head and body, three markers were selected on each segment, two near the back on left and right sides of the pigeon and one near the front (beak or chest depending on segment). Vectors were assigned from the right marker p R towards the front marker at p F to give the vector v RF. A second vector for each segment was formed from the right marker p R to the left marker p L to get v RL. In order to calculate an origin along the v RL that would have a vector connecting the front marker location and intersecting v RL perpendicularly at the origin, a projection onto v RL was used: p origin = p R + v RL v RF v RL v RL v RL (1) Also, the normalized cross product between the two vectors, v RL and v RF, gave a vector pointing up from the plane of the three chosen points: ẑ = v RF v RL v RF v RL (11) This ẑ vector for each of these two segments formed the first basis vector of their respective coordinate systems. The ˆx (front facing) vector was found using the difference of the origin and front locations of each segment: ˆx = p F p origin p F p origin (12) The final unit vector to be calculated for each coordinate system was ŷ, which represents a leftfacing vector found using the left marker location, p L, and the origin, p origin, of the head and body segments respectively: ŷ = p L p origin p L p origin (13) Using the location of each origin, along with the three unit vectors of each coordinate system, both translations and rotations were obtained as described shortly. The feet required a slightly different 9

17 procedure to derive their coordinate systems since there were only two markers placed on each foot. The average between these two markers, p F and p B, on each foot was taken to be the position of the origin, p origin, for the respective foot: p origin = p R + p L 2 (14) The front-facing vector ˆx was again the direction from the origin p origin to the front marker p F : ˆx = p F p origin p F p origin (15) The next step was slightly different for the feet since an up-vector had to be arbitrarily chosen for determining the coordinate frame. Since the pigeon was always assumed to be walking upright, the up -vector was chosen to be: v up = [ 1] T (16) Using this vector put a constraint on the system because there could no longer be any rotations about the ˆx axis of either foot. This reduced the degrees of freedom of each foot to 5, with 3 being for translations and 2 for rotations. Despite this reduction in degrees of freedom, the representation used for each foot still stored 6 variables for all possible degrees of freedom in the model. Using the v up vector, the left-facing direction was determined using the following cross product: ŷ = v up ˆx v up ˆx (17) Finally, using both ˆx and ŷ, the vector ẑ was found: ẑ = ˆx ŷ ˆx ŷ (18) Since the process of deriving the coordinate systems was the same for both the body and head segments, and a similar process was identical for each foot, the calculations above were not shown with any subscripts indicating which segment each variable represents. Shown below as a postsubscript are the markings for body (B), head (H), left foot (LF), and right foot (RF) segments. Also, up to this point, all the calculations were in the global coordinate system, which shall be indicated with a pre-superscript G, giving the following vectors: Body Head Left Foot Right Foot origin (O) G p O(B) G p O(H) G p O(LF) G p O(RF) forward direction Gˆx (B) Gˆx (H) Gˆx (LF) Gˆx (RF) left direction Gŷ (B) Gŷ (H) Gŷ (LF) Gŷ (RF) right direction Gẑ (B) Gẑ (H) Gẑ (LF) Gẑ (RF) Table 1: Notation for Coordinate Frames. 1

18 Once the coordinate frames for each segment were determined, it was possible to find the rotation matrices that related each frame. This was done using dot products between the two frames as shown in Equation 41 of Appendix A. The pre-superscript of a rotation matrix indicates the coordinate frame that results from left multiplying a vector by the rotation matrix. The postsubscript indicates what coordinate frame the original vector should be in. To convert to a body centered representation, the G R H, G R B, G R LF, and G R RF rotation matrices were computed first, from which others were derived. Since a rotation matrix is symmetric, the inverse is its transpose and thus the following compound rotation matrices were formed: B R G = ( G R B ) 1 (19) H R B = ( G R H ) 1 G R B (2) LF R B = ( G R LF ) 1 G R B (21) RF R B = ( G R RF ) 1 G R B (22) The first of these rotation matrices represents the rotation of the body frame with respect to the global coordinate system. The remaining three rotation matrices represent the rotations of the head, left foot, and right foot with respect to the body coordinate frame respectively. All four of these matrices were converted to exponential map representation [14] giving three values which define the direction of an axis about which the rotation occurs and whose combined magnitude indicate the size of the angle through which the segment rotates about this axis. These three values for each segment were therefore the three rotational dof for each segment. The translation of the body segment was simply its location in global coordinates, but the translations of the other three segments were calculated with respect to the body origin. To do this, the origin of each segment had to be multiplied by the B T G transformation matrix which, as defined in Appendix A, is: B T G = [ ( B ) ( R B ) G p BG ] (23) where B p BG is the position of the body segment origin converted to body coordinates as: B p BG = ( G R B ) 1 ( G p GB ) = ( G R B ) 1 G p O(B) (24) Then each origin in body-centered coordinates was given by: B p O(H) = B T G G p O(H) (25) B p O(LF) = B T G G p O(LF) (26) B p O(RF) = B T G G p O(RF) (27) The four origin positions (the three directly above and G p O(B) ) gave the twelve translational de- 11

19 grees of freedom of the hierachical system. Once the data was converted into body-centered coordinates, it was then made invariant to rotations and translations in the ground plane. 2.3 Ground Plane Invariance As in [1], to make the model as general as possible, it was required to standardize the starting position and direction of the pigeon in each data sequence. Since we were interested in learning pigeon behaviour and not specific locations of the pigeon, rotations about the ground plane vertical and movement in the ground plane itself were considered as velocities. Thus, instead of modeling absolute values of position and rotation, the data was converted into differences of these quantities between frames. This still allowed the model to capture the walking behaviour and to generate new motion from any initial position and direction. The ground plane vertical used was the negative z-axis of the global coordinate system v up. Rotating the x-axis and y-axis unit vectors of the global coordinate system into body coordinates using B R G gave two vectors that we could compare to the vertical to get the bend and tilt angles of the torso: ( [ 1] BR θ tilt = cos 1 G G [1 ] T B R G G [1 ] T ( [ 1] BR θ bend = cos 1 G G [ 1 ] T B R G G [ 1 ] T The above two angles represent the first two of the six degrees of freedom stored for the body segment. By projecting the forward facing vector of the body segment into the ground plane and calculating its x and y offsets, the ground-plane rotation was computed as: ( ) θ ground = tan 1 yground x ground Note that all rotations were phase un-wrapped so that angles were not restricted to be [ π,π]. This was done because the pigeon could potentially rotate more than 2π radians throughout a sequence. Since we wanted the velocities of that rotation and the translations in the ground plane x and y directions, first differences were taken for each of the quantities: ) ) (28) (29) (3) G θ ground (i) = G θ ground (i + 1) G θ ground (i) (31) G x O(B) (i) = G x O(B) (i + 1) G x O(B) (i) (32) G y O(B) (i) = G y O(B) (i + 1) G y O(B) (i) (33) G θ ground (i) was stored as the third rotational dof for the body segment. The final step was to convert the translational velocities to body centered representation by projecting the movements onto the forward and left facing vectors Gˆx (B) and G ŷ (B) of the body. This involved computing the magnitude of the velocities, computing the angles φ these velocities made, and projecting onto 12

20 each axis using the θ and φ angles as follows: vel(i) = ( G x O(B) (i)) 2 + ( G y O(B) (i)) 2 (34) ( ) G φ(i) = tan 1 y O(B) (i) G (35) x O(B) (i) trans1 = G x O(B) (i) = vel(i) cos( θ ground (i) φ(i)) (36) trans2 = G y O(B) (i) = vel(i) sin( θ ground (i) φ(i)) (37) These final two quantities, along with the z-component of G p O(B) (expected not to vary significantly throughout the motion and therefore not represented as a velocity), are the three translational dof stored for the body segment. 2.4 Data Normalization and Batch Preparation In the final representation there were 6 degrees of freedom (dof) per frame for each of the 4 segments, giving a total of 24 real values. Also, unlike [1], all translational dof were included to account for the articulated, multi-segment nature of the neck and legs. As a final step, after all velocities had been computed, the resulting data was scaled to have zero mean and unit variance for each of the 24 dof. This provides a smaller range centered at zero for the gaussian visible units to model. The data was re-scaled before measurement or playback. When a long sequence of motion was preprocessed, it was split into mini-batches of 1 frames for training. The order in which these batches were used during training was also randomized. These two techniques vastly improved the training results. The mini-batch learning provided more frequent updates than full batch learning and smoother updates than doing online gradient descent while the randomized order prevented the model from settling into a poor local minima based on the order of the sequences presented during training. After a new sequence was generated in the normalized space, all the preceding steps were carried out in reverse to get back to the global coordinates of the marker positions. The detailed steps have not been outlined here as this procedure is straightforward using the above information. 13

21 3 Motion Generation 3.1 Generation of Novel Motion Both one and two layer models were trained using each of the available data sets to visually determine which generated sequences tended to most accurately represent pigeon behaviour. For each data set, every frame was analyzed to ensure the data was free of discontinuities. Due to the large torso of the pigeon, the foot markers are often occluded during motion capture, thus this further analysis of every frame was necessary to ensure frames where markers were occluded were not used in training. If they were used, the model would learn the large gaps in the dof values that occur when a marker is occluded, giving a poor, noisy representation of the pigeon motion. Once trained, it was visually evident that two layer models produced better results compared to single layer models. The second layer tended to reduce noise in the generated sequences while adding more variety to the walking pattern in the long term. This comparison between models is further quantified later in this report. Additionally, sequences which were split often due to occluded markers tended to produce poor models. With smaller sequences of continuous motion, the model was unable to learn the long term dependencies used to generate good quality motion. One long sequence however was uninterrupted by occlusion and provided a great training set with a large variety of standing and walking in all directions. Another aspect of the data tested heavily while training models was the frames per second (fps) at which the data was presented to the CRBM. Originally the data was 12 fps, but it was thought that limiting the data to 3 fps would result in more significant differences in each dof between frames, thus providing a better training signal for the model. This allowed CRBMs of order 3 to be used and accurately reproduce motion (at 3 fps). However, after much comparison between models at 12 fps of order 12 and the 3 fps models of order 3, it was determined that the former models produced better results. Using the single continuous data set, the training parameters were adjusted to find an optimal model. After tuning the parameters, the best model was found to be a Gaussian-binary CRBM trained on frames of pigeon Mocap data at 12 fps by following the procedure described in Section 1.3. Once this first layer was trained, a second binary-binary CRBM was trained on the real-valued probabilities of the hidden layer while driven by the training data. Each CRBM had 6 hidden units. The learning rates for all parameters were 1 1 3, except for the autoregressive weights between visible units (1 1 5 ). During training a momentum term was also used where.9 of the previous accumulated gradient was added to the current gradient. Alternating Gibbs sampling was conducted for 1 steps (i.e. CD-1) during the training. Each of the two layers were conditioned on 12 previous frames. Once trained, this CRBM was used to generate novel pigeon motion. Generation from a trained 2-layer model proceeds as follows: initialize with 24 frames of training data at the visible layer and perform a single up-pass to arrive at activations at the 1st hidden layer (H1) providing 12 frames of 14

22 H1 initialization. The current time step at H1 is then initialized with the previous frame plus some Gaussian noise. Since the hidden units are binary, the random noise is not added to the resulting activations, but instead added to the logit (inverse of the sigmoid function) term which is then passed through the sigmoid function to get updated activations. Alternating Gibbs sampling is then performed between the two hidden layers, conditioning on the 12 frames of H1 initialization. Gibbs sampling provides activations at H1 and H2. The former is used to perform a single downpass to the visible layer using the weights of the 1st level CRBM, conditioning on the last 12 frames of visible initialization data. The above procedure is repeated beginning with initialization of the next frame of H1 using the current activations plus a small amount of Gaussian noise. Figure 9: Stroboscopic image from a walking pigeon. Figure reproduced from [15]. 25 z (mm) y (mm) x (mm) Figure 1: Example hold-phases present in generated motion. 15

23 In order to carry out experiments that elicit responses in real pigeons from a virtual pigeon driven by our model, we needed to first determine if our generated motion captured the subtle characteristics of pigeon motion. Fixed foot plants while walking and standing were some of such subtleties present in generating human motion [1] that are also present in pigeon motion. There is an additional characteristic in pigeon motion that must also be modeled, namely the complex head-bobbing which is defined by distinct alternations between thrust-phase and hold-phase. In the hold-phase, the head remains stationary in both position and rotation, while in the thrust-phase, the head quickly translates and rotates to the next hold-phase as seen in a stroboscopic image of a live pigeon (Figure 9) [1]. In generating from our model, we could immediately see head motion that closely resembled this complex head-bobbing behaviour (Figure 1). The collection of head coordinate frames in this generated motion shows the distinct hold-phase present while the collection of foot coordinate frames demonstrate the concrete foot plants. For various videos of generated motion see: Quantifying Head Motion Although the hold-phase was visually present in the generated motion, we have sought a quantitative comparison to the real motion capture data. To quantify the hold-phase, we measured the head frame s yaw axis and the path length of the head in the horizontal plane with respect to the global coordinate system [1]. In order to justify the performance of our model, we implemented a 3rd order autoregressive (AR) model fit by regularized least squares. Figure 11 shows a comparison between the training data, generated data from a 2-layer CRBM model, and the AR-3 baseline method. The second derivative of each line in the plot was calculated in order to detect the hold phases automatically. Where this second derivative was a maximum or a minimum corresponded to the leveling of the hold-phase regions (grey areas in Figure 11). Within these regions the standard deviations of each measurement were calculated, and the mean of all the regions within each plot was computed. These calculations were also done for the thrust-phase regions of the plots and are shown below, along with the hold-phase results, in Table 2. Model/Phase Path Length (mm) Yaw Angle (degrees) Training Data Hold Phase Training Data Thrust Phase layer CRBM Hold Phase layer CRBM Thrust Phase AR(3) Hold Phase AR(3) Thrust Phase Table 2: Comparison of Mean Standard Deviation in Hold and Thrust Phases. As seen in Figure 11, the training data displayed a step-like pattern in both the yaw angle and path length plots where the flat portions of each plot represent the hold-phase of the head mo- 16

24 mm mm mm 1 5 Path Length Training Data Frame Path Length 2 L CRBM Generation Frame Path Length AR(12) Generation Frame degrees degrees degrees Yaw Rotation Training Data Frame Yaw Rotation 2 L CRBM Generation Frame Yaw Rotation AR(12) Generation Frame Figure 11: Hold-Phases present in path length and yaw angles (shaded grey). tion. Similarly, the data generated from the 2-layer CRBM model clearly exhibited this behaviour. This is also evident by observing the large difference in standard deviations in Table 2. Due to the stochastic nature of the CRBM model, the standard deviations in the hold phase of the generated motion were greater than those of the original motion capture data. Since the CRBMs used here model ground-plane velocities instead of positions, any noise introduced by sampling was integrated throughout the output sequence during post-processing. Another aspect to note about the generated data is the smaller standard deviations of the holdphase for rotations compared to those for translations. This is likely due to the normalization step done during pre-processing of the data. The translations are typically larger numbers (expressed in mm) compared to the rotations (expressed in exponential map representation), thus any comparable noise is magnified in the translational dimensions when the data is re-scaled. One way to combat these results would be to adjust the precision of the Gaussian units, 1/σ i. From previous experience (see [1]), the precision of the Gaussian units works well when fixed at 1, which was done throughout our experiments. However, adjusting this parameter to a larger fixed value or learning this parameter during training could improve the results. 17

25 4 Prediction of Missing Foot Markers 4.1 Marker Occlusion Due to the fixed positions of the motion capture cameras, the feet of the pigeons were often occluded by the feathers and the relatively large torso of the pigeon. Frames with missing foot data could be ignored when training, but that limits the number of good frames with which we can train. As discussed earlier, training on data without the occluded frames generally offered poor results due to the shorter continuous sequence lengths. Thus it became a goal to remedy this marker occlusion problem. One method to solve this problem, as shown in Figure 12, is to predict the 6 degrees of freedom of a missing foot (shown with black background) based on the 18 remaining frames from the other foot, the head, and the body as well as the previous visible vectors and current hidden units. h This was done using two methods: alternating Gibbs sampling much like the generation algorithm, or by minimizing the free energy over the units. This latter method was only used for 1-layer models as it should provide an optimal result regardless of additional layers. More details of these v methods appear in the following two sections, but one other consideration was made before either procedure was used. A special normalization technique was used before the prediction t 3 t 2 t 1 t took place due to the presence of bad frames in the data. If the entire Figure 12: 1-layer data set (containing both good and bad frames) were to be used for normalization, CRBM showing then the calculated mean and standard deviation of the data prediction problem. would have been skewed heavily by the bad frames. When markers were occluded during capture, they were automatically set by the motion capture system to (,,) in global coordinates. This caused a large discontinuity in the data between good and frames. Therefore, to normalize properly, all frames with missing marker data were removed. Although this decreased the amount of valid data used for normalization, it prevented skewing of the normalized values and thus was used throughout the prediction experiments. 4.2 Method 1: Alternating Gibbs Sampling The first of the two methods for predicting missing information used alternating Gibbs sampling. This method was identical to 1-step generation except only the missing markers were reconstructed during alternating Gibbs sampling. To begin, the unknown markers were initialized to their values at the previous frame plus a small amount of Gaussian noise. The alternating Gibbs sampling then proceeded for 3 iterations utilizing all the connections of the CRBM. For a 1-layer model, this sampling updated all the hidden units using the initialization of the missing information in the current frame as well as the real data for the remaining dof of the current frame and all dof of the previous frames. Since the CRBM models were found to give best 18

26 results at order 12, at least 12 good frames at the beginning of a gap were needed for initializing the 1-layer model. On each downward pass of the Gibbs sampling, only the missing dimensions were filled in, providing a reasonable ratio between valid data to condition on and the missing data to be filled in. For a 2-layer model, we used a slightly different procedure than the traditional Gibbs sampling used for 2-layer CRBM motion generation discussed in Section 4.1. Aside from the obvious method of only updating the missing dof, the alternating Gibbs sampling used for prediction implemented a linear blending between visible and hidden layers. As before, the first hidden layer at the current time step was initialized to the values from the previous time step plus Gaussian noise added to the logits. The next step consisted of simultaneously conducting an upward pass to update the second hidden layer in parallel as well as updating the missing visible dof with a downward pass. Then, the second step of the alternating Gibbs sampling blended a downward pass from the second hidden layer with an upward pass from the visibles to get activations for the first hidden layer. The ratio of blending was controlled by a parameter, λ, which scaled the H2 inputs while (1 λ) scaled the visible inputs. Thus, by setting λ = a 1-layer representation was formed while setting λ = 1 limited the Gibbs sampling to only be between hidden layers before doing a single downward pass. In practice, setting λ =.1 worked best (see Figure 13). This is likely due to the heavier reliance on the visible information at both the current and previous time steps. Since the goal of this procedure was to fill in some dof of this visible data, knowledge of the remaining dof at the current time step had a much greater impact on the prediction performance than the long term information which results from using the second hidden layer. Average Prediction Error (mm) Effect of λ on Prediction Errors λ Figure 13: Effects of changing the ratio of visible influense, (λ). In all the cases above, regardless of changing the λ value, the order of the model, or how many layers were used, the alternating Gibbs sampling was subject to noise. This noise was the reason why we looked for an alternative procedure for filling in the missing marker data which marginalizes out the hidden units. 4.3 Method 2: Free-Energy Minimization Since the hidden variables are binary, they can be integrated out giving the free energy formulation of the system given the past observations and model parameters. In the expression (Eq. 7) 19

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine Matthew D. Zeiler 1,GrahamW.Taylor 1, Nikolaus F. Troje 2 and Geoffrey E. Hinton 1 1- University of Toronto - Dept. of Computer