Weighted Finite Automatas using Spectral Methods for Computer Vision

Size: px

Start display at page:

Download "Weighted Finite Automatas using Spectral Methods for Computer Vision"

Dwight Barton
6 years ago
Views:

1 Weighted Finite Automatas using Spectral Methods for Computer Vision A Thesis Presented by Zulqarnain Qayyum Khan to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Northeastern University Boston, Massachusetts April 2016

2 To Abbu Jaan, wish you were still here! i

3 Contents List of Figures List of Tables Acknowledgments Abstract of the Thesis v vii viii ix 1 Introduction Background Problem Statement Related Work Overview Weighted Finite Automatas Introduction Definition Transformations WFA Hankels Spectral Learning Empirical Hankel Recovering WFA Pre-Processing Introduction Posebits Posebit Selection Clusters of Velocities and Acceleration Hankel Matrices Clustering Gram Matrices Synthetic Experiments WFA Generation iii

4 4.2. String Generation from WFAs Evaluation Functions, Empirical Hankels, and Spectral Learning Evaluation Functions Empirical Hankels Spectral Learning Experiments to Evaluate Estimated WFAs Frobenius Norm Perplexity K-L Divergence Word Prediction Error Rate Experiments Experimental Setup MHAD Dataset Description Evaluation MSR3D Dataset Description Evaluation Composable Activities Dataset Description Evaluation HDM05 Dataset Description Evaluation UTKinect Dataset Description Evaluation Some Other Experiments Experiments on PbDb Dataset Experiments with Hankels Conclusion and Future Work 46 Bibliography 48 iv

5 List of Figures 1. Graphical Representation of a WFA The General Flow sheet of the learning algorithm Examples of posebits, and some posses condition on different posebits Different relationships in body parts Posebit Binary Tree Description of structure of a hankel matrix Snapshots from MHAD Dataset Snapshots throwing action from MHAD Dataset Confusion matrix for MHAD Dataset with s=90% Confusion matrix for MHAD Dataset with s=60% Confusion matrix for MHAD Dataset with s=99% Confusion matrix for MSR3D Dataset with s=90% Confusion matrix for MSR3D Dataset with s=95% Confusion matrix for MSR3D Dataset with s=75% Snapshots from Composable Activites Dataset Confusion matrix for Composable Activities Dataset with s=95% Confusion matrix for Composable Activities Dataset with s=99% Confusion matrix for Composable Activities Dataset with s=75% Confusion matrix for Composable Activities Dataset from [38] Confusion matrix for HDM05 Dataset with s=94% Confusion matrix for HDM05 Dataset with s=99% Confusion matrix for HDM05 Dataset with s=85% Confusion matrix for HDM05 Dataset following protocol of [2] Snapshots from UTKinect Dataset Confusion matrix for UTKinect Dataset with s=95% Confusion matrix for UTKinect Dataset with s=99% Confusion matrix for UTKinect Dataset with s=75% v

6 12.1. Scores when WFA is trained on walk Scores when WFA is trained on jogging Scores when WFA is trained on boxing vi

7 List of Tables 1. Perplexity comparison for estimated WFAs KLD comparison for estimated WFAs Comparison of accuracies with other methods on MHAD Dataset Comparison of accuracies with other methods on UTKinect Dataset vii

8 Acknowledgments Here I wish to thank everyone who has supported me during the process of the thesis work, especially Prof. Camps for advising and supervising my work, and Prof. Sznaier and Prof. Dy for agreeing to be on my thesis committee. I would also like to thank my lab fellows, especially Caglayan and Xikang for guiding and helping me. Last but not least I d like to acknowledge the support of my family back home and my support base here in Boston, my Minions. viii

9 Abstract of the Thesis Weighted Finite Automatas using Spectral Methods for Computer Vision By Zulqarnain Qayyum Khan Master of Science in Electrical and Computer Engineering Northeastern University, April 2016 Dr. Octavia I. Camps, Adviser There are many possible ways to model the machine or model that generates a set of sequences, Weighted Finite Automatas (WFAs) have been demonstrated to be a powerful tool in this regard by the Natural Language Processing Community. Spectral techniques of recovering WFAs from empirically constructed hankel matrices have also been demonstrated to work very well, with theoretical backing, and thus make the task of recovering the underlying machine very much possible. Our focus here is an attempt to port WFAs and the spectral recovery techniques to the field of Computer Vision, implementing every technique from scratch to gain more in depth understanding. More specifically we look at activity videos (simple and complex) as string sequences, where the goal is to then recover the underlying machines that generate similar activities. Different features are used to convert the videos into strings, spectral methods are then applied to demonstrate viability of WFAs in tasks such as Action Classification on multiple datasets. The results are encouraging but indicate a further refinement of the approach and more data is needed. ix

10 Chapter 1 Introduction 1.1. Background Recognizing, Classifying or segmenting sequences plays a major role in any field that deals with pattern recognition, be it text based Natural Language Processing, or image based Computer Vision. There are multiple ways to identify sequences, one possible way can be to try and differentiate between sequences based on appearance or motion or any other features. Another way is the approach explored in this thesis, and that is making the assumption that instead of directly comparing sequences what if the underlying system that generates those sequences can be modelled, for example [1] and [2] take the approach by attempting to identify and comparing the dynamical systems that generate activities and then using different metrics for the task of activity recognition. Other possible approaches can be broadly classified as Generative, such as HMM based modelling that have been around since as early as [3] to more recent approaches such as those used by [4] vs Discriminative models which have been in more use recently, such as SVMs and Artificial Neural Networks (ANNs), such as those utilized by [53][23][30][54]. Keeping these in mind and the work done by Borja Balle et al [5] in the Natural Language Processing community, the intention is to introduce another generative model, namely Weighted Finite Automatas to the Computer Vision community. 1

11 1.2. Problem Statement We start with a from scratch implementation of [5], to develop a more in-depth understanding of the working of Weighted Finite Automatas, and also to make it easier to adapt it to tasks more specific to us. The next step is to test the implementation on synthetic data, for this we ll need to implement a synthetic WFA generator, as well as a generator that can mimic producing strings from WFAs. After testing the discriminative ability of the WFAs on synthetic examples and satisfactory implementation, we move on to applying the WFA and the spectral techniques associated with them to Computer Vision tasks of Activity Recognition and Action Segmentation. The goal is to demonstrate usability of WFAs in the community and provide this as a tool. To use WFAs pre-processing of activity recognition videos needs to be done in different ways, which is also tackled, with a related issue being what kind of videos to use. For now we deal with videos that provide skeletal joint locations Related Work The body of work related to this thesis can be broadly divided into two different subsections, which are touched upon separately below: Weighted Finite Automatas: To a large extent this is the main focus of the thesis, implementing and following the lead of [5], who in turn are motivated by more detailed work on Automatas, like [6] on spectral learning and Quadratic Weighted Automatas, fundamental work on automatas and theorems that form the backbone of this work can be found in [7]. Activity Recognition: The problem of activity recognition is one of the most intuitive and commonly tackled problem in Computer Vision, despite that it also remains one of the most complicated ones. This interest and complexity has spawned a number of ways to attack the problem. The approaches vary inherently as well as based on the kind of data they are dealing with, some being more efficient in tackling data that has skeletal joint information, some dealing with dynamics, and yet others motivated more by appearance based features. The list of work in the area is exhaustive and for brevity we ll just point out to approaches that are different from each other to give the reader an idea of the work been done. 2

12 Recent work includes approaches that are based on grammars, such as those using segmental grammars to parse videos, for example [8] that uses a latent structural SVM to train grammar parameters, learning the hidden sub-actions in the process, other similar approaches make use of Context Free Grammars (CFGs), such as [9][10][11][12]. This, looking at actions as a set of sub-actions approach is very natural and intuitive and hence is oft-utilized, for example by the likes of [13][14], which used decompoasable motion segments and learning temporal structures for the task. Yet another way is to make use of spatio-temporal features such as optical flow [15][16][17], and Bag of Features [18]. Longer video sequences that have multiple activities tend to be dealt with by by probabilistic models such Hidden Markov Models (HMMs) [4] including earlier Finite State Machines [19][20], to the more recent models such Conditional Random Fields (CRFs) [18]. Further variability in length of video sequences is tackled by approaches such as Hierarchical HMMs [21][22][23] or segmental HMMs [24][25][26][27]. A very different way of approaching the problem is to assume that there are underlying systems that generate a particular activity, and then to make use of Hankelets and Dynamical Distance metrics to identify those systems indirectly, such as done by [1] and [2] Overview The thesis is further divided into chapters dealing individually with the different steps and methods involved. What follows next in Chapter 2 is an in-depth discussion and explanation of Weighted Finite Automatas, their implementation, their generation, as well as the spectral techniques used to recover them. Chapter 3 deals with the pre-processing step, that is, how to convert available videos into strings that can be processed by the WFAs. Chapter 4 is an explanation of our implementation of the whole process and the synthetic experiments done to establish confidence in the method moving forwards. Finally Chapter 5 provides results on multiple real-world datasets with skeletal joint information. This is followed by a conclusion of the whole work and a brief discussion on what the future holds in this direction. 3

13 Chapter 2 Weighted Finite Automatas 2.1. Introduction Weighted Finite Automatas (WFAs), also referred to as Observable Operator Models (or OOMs) [28], are a generalization of HMMs [29][28]. WFAs can be viewed as a more expressive form of HMM with the advantage that this expressiveness doesn t come at the cost of an increased complexity in learning, in fact as [28] points out, they re often easier to learn, WFAs like HMMs are inherently random models and hence are best suited to model systems that are intrinsically random themselves. Moreover WFAs can be probabilistic as well as improbablistic. Keeping this in mind we now move on to a formal definition of WFAs 2.2. Definition From an application point of view WFAs are functions that map strings to real numbers, more formally as defined in [5] WFA W with n states can be completely defined by the set of tuple W,,{ A } 1 over a set of symbols where, 1 n R - is the initial state probability vector n R - is the termination probability vector A R nxn - are the transition probability matrices for each symbol 4

14 Figure 1 (a) Graphical representation of a WFA with 2 states (n=2) and { ab, } (b) operator or matrix representation of the same WFA Given this form of WFA and a string x, it can be used to model the probability (or score) of the given string being generated from the WFA W, as follows: T f () x 1 A x (1) Where A x is a product of all the matrices associated with the symbols in the string x. For example given the WFA of figure 1, and a string x = aba we ll have, f ( x) f ( aba) A A A (2) T 1 a b a Since this is not a probabilistic WFA, a higher score indicates a higher likelihood of the string coming from this WFA Transformations Another useful characteristic of a WFA is its ability to model different scoring functions, through equivalent transformation. Two transformation pointed out both by [5] and [28] make it easier to manipulate and use WFAs. Given the WFA W as defined in the previous section, It is possible to transform it into two equivalent WFAs W and W, by using the following transformations: s Transformation 1: Given a WFA W,,{ A } 1 W,,{ } s 1 s s A, where p X ( I X) T T 1s 1 s A ( I X) 1 1, it can be transformed into Given this representation we can evaluate another scoring function f ( x) E[ w ] s x (3), that is the expected number of times the string x appears in a string w as a substring, where now it ll be given by the equation 5

15 f x E w A (4) T s( ) [ ] x 1s x s This is a critical transformation and one that we ll make use of often in this work. Transformation 2: Third representation W p can be obtained by only applying the transformation of (3) to the final probability vector, with the transformed WFA coming to be defined by the tuple W p,,{ } 1 s A, this now realizes another scoring function that realizes the scores/probabilities of x being a prefix in the sample space of the strings * T p( ) [ ] 1 s x s *, i.e. f x P x A (5) We did not find any useful application of this transformation and hence this is mentioned here just for completeness. The proves of both transformation are discussed in detail in [5] WFA Hankels Now we introduce another important building block in this discussion, i.e. creating hankel matrices from which WFAs can be recovered. The idea is to construct a large matrix H f PxS R, such that H f( p, s) fs( p. s), where p P and s S, P and S being the set of all possible prefixes and suffixes respectively. Now in this way, theoretically the hankel matrix will be of infinite size, and hence it ll be impossible to work with it. To circumvent this problem a basis is defined by restricting the set of prefixes and suffixes before hand. Now the Hankel can be created while being finite. The choice of basis depends on the problem at hand, the important part being that the values of this hankel now correspond to the scores obtained from an underlying WFA. Example: Let s assume we have a set of sample strings X { aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa} If we want to create a hankel matrix that realizes the substring expectations such that those given by (4), we can define a set of basis P {, a, b, ba} and S { a, b}, and empirically fill in the hankel matrix with these expectations such that Giving the empirical hankel of the form N 1 i HS ( p, s) I[ x x] (6) N i 1 6

16 H s a b a b ba In our case we ll generally define the hankel matrix with equal set of basis, since it s easier to deal with and is much more intuitive Spectral Learning Figure 2. The general flow sheet of the learning algorithm, the training data is used to create the empirical hankel matrix, which is then factorized to create the underlying WFA. The spectral learning of WFA from data can be divided into two parts: Empirical Hankel Now we get to a very integral part of the method that is the spectral learning of the underlying WFA responsible for generating a set of sequence. Let X be an available training set of N strings, also assume the strings consist of the alphabets a and b, appearing in differing orders. The first step in learning the underlying WFA from this sample set is creation of the empirical hankel matrix. The critical property of this hankel matrix as mentioned in Theorem 1 below is that the rank of this hankel matrix gives the number of states in the WFA, the theorem of course holds for the theoretical case of infinite matrix. Theorem 1: [30] [31] f 1. If A f for some WFA A with n states implies rank( H ) n 2. If rank ( H f ) n implies exists WFA A with n states s.t. f fa This is an important theorem in the context of the work, however, working with infinite matrices is not possible in practice and hence as pointed out earlier we need to define a set of basis in advance. This big hankel H is a concatenation of empirically constructed hankels for each alphabet and the empty symbol. i.e. H [ H H ] PxS Where each of the sub hankels are of dimensions R, if P and S are the number of prefix and suffix bases. Two more hankel vectors are needed for the learning which are f 7

17 h h p,, s R R Px1 1xS Moreover each H, a sub-block of the big H where H ( p, s) H( p s) Example: Consider a set of sequences with 2 symbols {a,b} and a basis of the form P { a, b, aa, ab, ba, bb} S, then the matrices and vectors discussed above will look like fs ( ) fs ( a) fs ( b) fs ( aa) fs( ab) fs( ba) fs( bb) fs ( a) fs ( aa) fs ( ab) fs( aaa) fs( aab) fs( aba) fs( abb) fs ( b) fs ( ba) fs( bb) fs( baa) fs( bab) fs( bba) fs( bbb) H fs ( aa) fs( aaa) fs( aab) fs( aaaa) fs( aaab) fs( aaba) fs( aabb) fs ( ab) fs ( aba) fs( abb) fs ( abaa) fs ( abab) fs ( abba) fs( abbb) fs ( ba) fs ( baa) fs ( bab) fs ( baaa) fs( baab) fs ( baba) fs( babb) fs ( bb) fs ( bba) fs ( bbb) fs ( bbaa) fs( bbab) fs( bbba) fs( bbbb) fs ( a) fs ( aa) fs( ab) fs ( aaa) fs( aab) fs( aba) fs( abb) fs ( aa) fs ( aaa) fs( aab) fs( aaaa) fs( aaab) fs( aaba) fs( aabb) fs ( ba) fs ( baa) fs( bab) fs( baaa) fs( baab) fs( baba) fs( babb) Ha fs ( aaa) fs ( aaaa) fs ( aaab) fs( aaaaa) fs( aaaab) fs( aaaba) fs ( aaabb) fs ( aba) fs ( abaa) fs( abab) fs( abaaa) fs( abaab) fs( ababa) fs( ababb) fs ( baa) fs ( baaa) fs( baab) fs( baaaa) fs( baaab) fs( baaba) fs( baabb) fs ( bba) fs ( bbaa) fs( bbab) fs( bbaaa) fs( bbaab) fs( bbaba) fs( bbabb) 8

18 fs ( b) fs ( ba) fs ( bb) fs( baa) fs( bab) fs( bba) fs( bbb) fs ( ab) fs ( aba) fs( abb) fs( abaa) fs( abab) fs( abba) fs( abbb) fs ( bb) fs ( bba) fs ( bbb) fs( bbaa) fs( bbab) fs( bbba) fs( bbbb) Hb fs ( aab) fs ( aaba) fs ( aabb) fs( aabaa) fs( aabab) fs( aabba) fs( aabbb) fs ( abb) fs ( abba) fs ( abbb) fs( abbaa) fs( abbab) fs( abbba) fs( abbbb) fs ( bab) fs ( baba) fs ( babb) fs ( babaa) fs( babab) fs( babba) f s( babbb) fs ( bbb) fs ( bbba) fs ( bbbb) fs ( bbbaa) fs( bbbab) fs( bbbba) f s( bbbbb) h fs ( a) fs ( b) f ( ab) h fs ( ba) fs ( bb) s T P,, S fs ( aa) Recovering WFA Once the above hankels have been learnt the recovery part is pretty straightforward, and involves taking an SVD and doing some matrix multiplications and inversions. The step by step algorithm is as follows: 1. Given the Hankel Matrices, H and H 2. Take a reduced SVD of H UDV T, based on the desired number of states n 3. Let X U * D, 4. Then Y V 1 1 (, ) T s h SY, T A X H Y 1 s X hp,, 1 1 We have found substring counting to be more intuitive and hence the empirical hankels are created by substring expectation calculations, that s why the recovered WFA is defined by the tuple W,,{ } s 1 s s A and can be transformed into W,,{ A } 1 transformations discussed in section 2.3. using the 9

19 Chapter 3 Pre-Processing 3.1. Introduction The WFAs, the way they are implemented here primarily deal with strings, while in our target tasks we are dealing with videos. So our data needs to be pre-processed in order for it to be ready for training the WFAs. The intention is to convert the available data into a set of representative alphabet sequences. Different fairly simple ways are explored to this end, the intention in most of them being exploiting dynamical information rather than appearance based. This is also one of the reasons why we deal primarily with videos that have skeleton joint information available Posebits One of the very initial features that we started off with is the use of Posebits, as introduced in [32]. Posebits are a mid-level representation and are based on Boolean relationships between body parts, for example, is the left arm in front of the right arm etc. More examples are shown in Figure 3. The idea is to directly infer them from image features using a trained classifier. They are by nature compositional and hence are very flexible as compared to just action class labels. The dataset made available by [32] is known as Posebit Dataset (PbDb) and is mainly made up of videos collected from 4 further different datasets, some with available MoCap Data while other being 2D images. From MoCap data there are 10,000 poses taken from Human-Eva [33] and HMODB [34] while for 2D images they use the Fashion [35] and Parse [36]. 10

the Human-Eva dataset information since it corresponds more to the task we are initially looking at,

20 Figure 3. Examples of posebits, and some poses conditioned on different posebits [32] Out of these we make use of the Human-Eva dataset information since it corresponds more to the task we are initially looking at, that is action classification Posebit Selection Figure 4.1. Different relationships in body parts that posebits intend to exploit. Joints distance, relative positions, articulation angles. 11

21 Figure 4.2. Posebit Binary Tree: the poses in each leaf node, constrained by all posebits in a posebyte. Not all posebits are created equal, [32] argue that it is important to select posebits based on what tasks you intend to perform with them. To this end they propose a simple selection mechanism inspired by decision trees to choose a subset of posebits from the available ones, based on the task at hand. For example for 3D pose estimation (and activity recognition) the aim is to choose a subset of posebits using the following two criteria for posebits selection: - Reliability inferred from image features, r - How helpful they can be in reducing uncertainty in the hidden variable, x To select a subset S m from the available posebit candidate pool S c [32] use a forward selection mechanism to select the posebits with a greedy approach. That is, one bit at a time. With each next posebit at step j selected to maximize information gain * C R a arg max I j I j. I j as (7) M Where, I j - mixed information gain at the j-th level of the tree C I - Clustering term R I - Reliability term - balances the two terms, generally kept at

22 The clustering information gain is further defined in terms of entropies as: I H H (8) C j j1 j Where defined as : H j is the sum of entropies weighted at each node of the j-th level of the tree, it can be H j j 2 X SC X H( SC ) (9) X S c1 X S C being the subset of poses X laste term ( ) C x p in class C, HS is the differential entropy. The reliability measure is defined as, X S being the bigger set of MoCap poses, while the Q( X r, m) p( x a) p( a r) (10) m aa p( x a ) and p( a r ) being the conditional pose and posterior posebyte distributions respectively. For posebits classification a structural SVM model is used: ^ T a arg max F( r, a, w ) w ( a, r) (11) j j a j a j j a A j Where ( a, r) j is the joint feature map of input r and output j The experiments we did combining posebits with WFA will be discussed in the experiments section. a Clusters of Velocities and Accelerations The second approach we used is by utilizing the skeleton joint informations available with datasets such as Berkley Multimodal Human Action Database (MHAD) [37], Composable Activities Dataset [38], UT Kinect Dataset [39] and extracted from larger datasets such as J-HMDB [40]. 13

23 Given the skeleton joint positions, for example in 3D, x, y, z we first center the skeleton around one of the joints (usually the hip joint), afterwards the mean of all the frames is removed to center the skeleton in the center. Afterwards a combination of these three simple techniques is utilized: 1. Sub-Sampling: In most cases the joints do not move too much from one frame to the other and using all frames can result in redundantly long sequences while capturing much less information, for this purpose first of all instead of using each frame, an average skeleton is taken from K frames at a time. F subsampled K j1 K F j (12) 2. Velocities: Once these subsampled skeletons have been obtained, the velocities of these skeletons are taken to account for first order motion v F F (13) j subsampled subsampled 3. Acceleration: The acceleration is represented by taking differences of these velocities j aj vj vj 1 (14) Once this has been obtained, different combinations of these are utilized and the next step is to do K-means clustering on them, with number of clusters C serving as the number of characters in the alphabet of the WFA. Matlab s inbuilt Kmeans++ algorithm [41] j Hankel Matrices This is a more complicated way as compared to the ones discussed above. But can potentially encode much more dynamical information. From control systems we know that a dynamic system can be defined by the following set of equations: y Cx w x k k k k Ax k 1 (15) Dynamical systems play a pivotal role in recognition systems that emphasize more on dynamics as compared to appearance, these include systems for recognizing various tasks such as gait, dynamic texture recognition and activity recognition systems etc. The basic idea is gleamed from system ID methods, that is the identification of the A and C matrices in eq.15 from training data. 14

However, most of the times, and specially in computer vision the identification of these matrices is not an easy task, as these matrices are not unique and trying to recover them can lead to

24 However, most of the times, and specially in computer vision the identification of these matrices is not an easy task, as these matrices are not unique and trying to recover them can lead to non-convex problem statements. To work around this [1] introduced the making use of the special structure of Hankel Matrices [42], it is important to mention here that these hankel matrices are different from the ones discussed in Chapter 2. To understand hankelets (Tracklets of Hankels), consider a tracklet from a video sequence with measurements t k the underlying dynamic sequence behind this tracklet can be modelled by a linear regressor [43] n t a t, k s n k i1 This regressor can be modelled as a hankel matrix H D (to differentiate from the hankel matrices previously discussed we are adding the subscript D), in the absence of noise, such that rank( H D) order of the system, i ki (16) H D t1 t2... ts t t... t.... t t... t 2 3 s1 r r1 rs1 (17) Figure 5. The line represents a trajectory, with colored points representing observation, the matrics on the right shows how to create a hankel matrix from these observations The important argument by [42] in favour of this Hankel Matrix is that it captures the underlying dynamics of the system irrespective of the initial conditions or in other words two Hankelets from two trajectories output from the same underlying system will span the same linear subspace. They show this by factoring H into, where is the observability matrix and X is the state matrix, that is D 15 X

25 C CA, X x0 x1... x. m CA m (18) These hankel matrices can be formed either by using trajectories or any other features such as joint information etc. For our purposes we follow the lead of [2] and use gram matrices of Hankels encoding the joint positions in the Hankels, that is each observation t i encodes the 3D locations of joints in each frame t [ x, y, z, x, y, z,...] i (19) i i i i i i T Given the hankel matrix defined as in (17) the corresponding gram matrix is given by Clustering Gram Matrices ^ G H H T D D T (20) H D H D F The next step in conversion to the grammar required by WFA is the clustering of Gram matrices defined by (20), for clustering a distance like metric needs to be defined to find the centers of clusters, since these matrices live on the Positive Semi Definite (PSD) manifold, [2] mentions a number of metrics that can be used for the purpose including Affine Invariant Riemannien Metric (AIRM) [44], defined as, given two Gram Matrices XY, d X Y X YX 1/2 1/2 R(, ) log( ) F (21) The second one that can be used is the Log-Euclidean Riemannian Metric (LERM) [45] d ( X, Y) log( X ) log( Y) (22) le Another metric that they mention and argue in favour of is and hence is used here is the Jensen- Bregman Log-det Divergence (JBLD) [46], defined as X Y 1 dj ( X, Y) log log XY (23) 2 2 F 16

26 This JBLD defined in eq. 23 is what we use here in clustering with the mean (or the center of clusters) defined as X * N arg min J( X, X ) (24) X i1 i So, in summary, if we have a set of sequences of different activities, we chop the sequences into smaller overlapping sequences, encode them into gram matrices and cluster them using JBLD, the cluster labels will serve as the alphabet for training WFAs. 17

27 Chapter 4 Synthetic Experiments Before moving on to the target tasks in computer vision, we felt it important to establish our implementation of the Weighted Finite Automatas and their discriminative capabilities on firmer grounds by performing a variety of synthetic experiments, the details of which we ll discuss here with examples. This chapter will also cover the implementation details of the WFA part of the work, topics covered include: 1. WFA Generation 2. String Generation from WFAs 3. Evaluation Functions, Hankel Construction, and Spectral Learning 4. Experiments to Evaluate Estimated WFAs 4.1. WFA Generation The first step in performing synthetic experiments was the establishment of a Ground Truth. Which means having the ability to create WFAs on our own with different specifications. This was handled from the knowledge that a WFA W defined by the tuple W,,{ A } 1, where N 1 R - initial state probability vector N R - termination probability vector A R NxN - transition probabilities for the symbol 18

28 And the knowledge that we can create a probabilistic WFA by following the below rules: N i1 N j1 1 1i { A } 1, i 1... N ij i (25) In addition to this we also provide a sparsity option to control how dense or how sparse (in terms of connections between states) we want our WFA to be. Example: An example of the WFAs we created is the following WFA, the code takes input the number of states N and number of characters of the alphabet S. for example, with N = 6, S = 3 ( abc,, ), the following WFA was generated with 80% density T 1 [ ] = [ ] T A a

29 Ab Ac String Generation from WFAs Now that we have created a WFA, this WFA can be used to generate random sequences by traversing the states of the WFA, these strings can be used for training as well as testing. The following steps are followed to generate a string from a WFA W,,{ A } 1 - Select initial state by sampling the states based on the initial probability distribution defined by 1 - The next state, as well as the symbol to be emitted is selected based on the probability distribution defined by rows of { A } and the termination vector - No length limit is imposed, the generation stops once the termination state is reached Example: Based on the WFA from section 4.1. one of the strings generated looks like this x bcbcabbaabcbaacbcbccbbcaabbbbbaacbacca For our training purposes in the synthetic experminets we generate 10,000 strings from each WFA with an average length of around 12 characters, the longest string generated was

30 characters, minimum being an empty string. Which shows WFAs cover wide range of sequences in terms of length Evaluation Functions, Empirical Hankels, and Spectral Learning These have already been discussed in Chapter 2, for completeness and narrative purposes we ll mention them very briefly here Evaluation Functions Given the WFA W,,{ A } 1 and its transformation W,,{ } s 1 s s A, and a string or substring x, the following two evaluation functions are implemented by simple multiplication T f () x A x i) 1 T f () x A ii) 1 s s x s In practice the scores are kept positive, and also normalized to make sure the length of the sequences has lesser effect on the scores Empirical Hankels The structure and of these Hankels and the number of Hankel Matrices calculated as well as the entries are exactly as mentioned in section 2.5. In terms of implementation, in the case of the running example of this chapter with 3 characters. We select the basis PS { all combinations of a,b,c up to length 4}, which results in a basis of length 121 thus the size of the Hankels mentioned previously would be: H, H, H, H R h h p,, p R R a b c 121x1 1x x121 The number of the basis based on number of characters S, length of the basis l is given by the following formula: 21

31 S S basis 1 S l1 1 (26) Moreover while selecting basis, we do a frequency counting and the order of the bases depends on their frequency as substrings in the training strings. Also, since we have the ground truth WFA available we can directly fill up the entries in the Hankel Matrices, to create what we call Theoretical Hankel. If our Empirically estimated hankels are close to these Theoretical Hankels it means our estimation corresponds well to the theory. Recall that the entries are filled in by the evaluation function given by: f ( x) E[ w ] w P( w) s x w x Empirically, this means calculating the expected value of each combination of bases in the training set Spectral Learning Once we have the empirical hankel matrices, we can proceed with learning the underlying WFA, since the hankels are constructed while using expected values, the recovered WFA (using the methods outlined in Section 2.5., is the transformed one ^ W,,{ } s 1 s s A, which of course can be transformed to correspond to W. An important thing to remember is the spectral technique does not guarantee the recovery of a unique WFA however, so in terms of entries in the matrices the estimate ^ W and the ground truth W can and most of the time will be very different, but what we are interested in more is there behavior, which will be evaluated using the experiments outlined in the next section Experiments to Evaluate Estimated WFAs To make the case of whether or not the estimated WFAs are a good approximation of the ground Truth we did the following experiments: Frobenius Norm Frobenius Norms are generally a good metric to compare matrices. We posit that even before the spectral learning method kicks in it is important to establish the closeness of empirically 22

32 calculated Hankel Matrices with the Theoretical Hankels created using the ground truth WFA. If eh is the estimated Hankel and Hgt is the theoretical Hankel, the normalized frobenius norm distance can be calculated as: eh Hgt F d( eh, Hgt) (27) Hgt F We ran exhaustive experiments creating WFAs with different number of states, and alphabets, and in all cases found that the difference in (27) never went above 2%. Moreover when the same was calculated with different WFAs, the frobenius norm distance was found to be larger than the distance of estimated hankel from the ground truth. For example, with an estimation using 10,000 strings, the froenius norm distance of (27) vs the ground truth Hankels was found to be around 1%, when the ground truth was compared to Hankels constructed from false WFAs the distance was much larger (on average more than 10%). Very similar behaviour was observed when the same calculations were done using subspace angles and JBLD instead of frobenius norm. All establishing the validity of our counting process and the empirically created hankels Perplexity This is a fairly popularly used metric in Natural Language Processing, and is also suggested by [5]. The idea is to evaluate a number of test strings, treating them as an ensemble, normalizing the resultant scores to sum to 1, thus making a probability distribution over the ensemble. If the estimated WFAs are good enough this probability distribution should be close to the distribution obtained if the scores are calculated using the ground truth WFAs. Perplexity is a measure used to compare probability distributions, it is defined as follows: Given a probability distribution px ( ) and its estimate qx, ( ) the perplexity P( p, q ) is defined as: p( x)log( q( x)) P( p, q) 2 x (28) 23

33 A lower perplexity means a closer approximation, but an important thing to remember is that there s a lower bound as well, a perplexity lower than the lower bound also indicates a farther approximation, this lower bound is given by the self perplexity of p p( x)log( p( x)) L( p) 2 x (29) 1000 For example, in our case we have a test ensemble consisting of 1000 strings, i.e. { x }, i i 1 the probability distributions are calculated. The following tare the number for perplexity and lower bound 1000 p( xi)log( q( xi) i1 P( p, q) p( xi)log( p( xi)) i1 Lp ( ) As can be seen the perplexity, , obtained from the estimated WFA vs Ground Truth WFA is very close to the Lower bound , indicating that the estimation q is fairly close to p. To drive home the point the same calculations were done with false WFAs vs Ground Truth WFA, the results are tabulated in the table below Table 1. The perplexity values for false WFAs vs the ground truth WFA, either they are above the perplexity obtained with our estimate, or well below the Lower bound, indicating in both cases that our estimate is performing well. N Perplexity KL Divergence This is very closely related to perplexity, considering both model entropy, with the exception that the lower bound for KLD is zero, as it s defined as follows px ( ) KLD( p, q) p( x)ln (30) qx ( ) Just like Perplexity, a lower KLD indicated a closer estimation. The value we obtain with our estimate is, 24 x

34 1000 px ( i ) KLD( p, q) p( xi )ln qx ( ) i1 i Table 2. Again, the KLD values here are higher than the KLD value obtained by our estimate, indicating the estimate is close to the actual WFA. N KLD Word Prediction Error Rate As mentioned in [5] given a prefix, WFAs are able to predict the next symbol. That allows us another way to see how close our estimated WFA is performing versus the ground truth WFA. We define Word Prediction Error Rate (WPER) The number of times the prediction of ground truth WFA W differs from the prediction from the estimated ^ W divided by total number of symbols predicted The predictions are done the same way as suggested by [5], that is given a wfa W and its transformw s, if a prefix w 1,i is provided do the following: - Compute scores it A, for all possible symbols it - Compute the score for end of sequence s - 1 1, A is calculated iteratively it i1 wi it Aw i s - The symbol (or end of string) that gives highest score is predicted In our case the WPER remains on average around 5% that is for 100 predicted symbols there are only 5 errors. 25

35 Chapter 5 Experiments After establishing the grounds for the use of WFA with experiments on synthetic data. We proceed towards applying our work to different datasets in Computer Vision, specifically in the Activity Recognition scenario. What follows in this chapter is an explanation of the datasets used, followed by the experiments performed while varying different parameters, and results obtained on each dataset. Some of the initial experiments, for example those done on Posebit Database were done as a proof of concept, and thus are not as detailed as those done later on with other datasets Experimental Setup As described earlier there are different parameters to play with in the method, including the methods for preparing the alphabet needed to train the WFAs, the number of states for the WFAs, the number of basis and the number of symbols. Since the WFAs in general require considerably more data to train than is available in these datasets, we use leave one out strategy for evaluation. The idea here is to train one WFA per action, leaving out one sequence for testing, and then checking the scores assigned by all the WFAs to that sequence, if the maximum score assigned corresponds to ground truth label, it s a correct recognition, if not, it s counted as an error. Instead of controlling the number of states of individual WFAs (which would leave us with too many parameters to tune), we control the number of states en-masse by using the percentage rule: n the number where the eigen value of H s % of the highest eigen value of H 26

36 Thus, varying this ' s ' allows us to vary the number of states of the WFAs, without individually tuning them. Other parameters include the number of symbols (or clusters)' C ', and the overlap window while considering velocities, acceleration etc. During the course of our experiments we found that C = 10, seems to give good results, having too less symbols leads to more monotonous sequences, while having too many symbols can lead to slowing down of the process without yielding any significant improvement MHAD Dataset Description Multimodal Human Action Database (MHAD) [37] is one of the most ubiquitous action recognition datasets with 3D joint information in the Computer Vision community. The dataset consists of 11 actions performed by 12 actors. Each action is performed 5 times each. One sequence is missing, so the total number of sequences is 659. The actions are as follows, the same numbering will be followed in the confusion matrices shown in this section 1. Jumping in place 2. Jumping Jacks 3. Bending hands up all the way down 4. Punching/boxing 5. Waving two hands 6. Waving One hand 7. Clapping hands 8. Throwing a ball 9. Sit down then stand up 10. Sit down 11. Stand up The activities have varying level of dynamics, some just in the upper body, like waving, punching, clapping etc, while others have dynamics in the whole body. So this dataset can be considered a naturalistic dataset. 27

joints, 3-Dimensional, leading to 105 point feature vectors. Figure 6.2.

37 Figure 6.1. Snapshots from one of the actors performing the 11 actions in MHAD, we make use of the MoCap data, with 35 joints, 3-Dimensional, leading to 105 point feature vectors. Figure 6.2. Snapshots from throwing action as captured by MoCap cameras from different angles Evaluation For MHAD Dataset, results in Figure 6 are the best results that we were able to achieve while using the parameters C 10, s 95%, basislength 2. 28

38 Figure 7.1. Confusion matrix for Activities in MHAD Dataset, with an average accuracy of around 90%. As opposed to this, if a very low s 60% is used, it means the WFAs are not able to capture much the dynamics of the sequences, and hence the performance drops drastically, as shown in Figure 6b. Figure 7.2. Modelling the activities using a lower number of states leads to a significant drop in performance, as shown here for MHAD Dataset, the average accuracy has gone down to around 54%. 29

Similarly, increasing s to a higher percentage, which means allowing for a higher dependence on the number of states gleamed from the training data, can result in over-fitting, and hence once again a

39 Similarly, increasing s to a higher percentage, which means allowing for a higher dependence on the number of states gleamed from the training data, can result in over-fitting, and hence once again a drop in performance is observed. Figure 7.3. Confusion Matrix showing a drop in average accuracy when the WFAs are allowed to fit too much to the data s 99% Overall, the number of states plays a critical role in the performance of the system, we noticed a bell shaped trend relative to the value of s, that is increasing s yielded an improvement in the average accuracy up to a certain value (generally close to 95% ), any further increase results in a deteriorating accuracy. The MHAD Dataset is now a solved dataset, and hence our accuracy is not state of the art, recently [2] have demonstrated 100% accuracy, and also list the accuracies achieved by other methods, we are copying there results here. Table 3. Comparison of accuracies with existing methods shows there is room for improvement. Method SMIJ [47] RBF Net [48] Dynemes [49] Bio-LDS [50] Average Accuracy (%)

40 HBRNN-L [51] 100 G-L [2] G-J [2] G-A [2] G-K [2] WFA MSR 3D Dataset Description Microsoft Research 3D (MSR-3D) dataset [52] is another popular action recognition dataset which provides the mocap information, the dataset consists of 20 actions, of varying similarities and dynamics, from full body movements to partial body movements. Each action is performed by 10 subjects, leading to a total of 557 relatively short sequences. The actions are as follows, the same numbering is followed in the confusion matrices: 1. High Arm Wave 2. Horizontal Arm Wave 3. Hammer 4. Hand Catch 5. Forward Punch 6. High Throw 7. Draw Cross 8. Draw Tick 9. Draw Circle 10. Hand Clap 11. Two Hand Wave 12. Side Boxing 13. Bend 14. Forward Kick 15. Side Kick 16. Jogging 31

41 17. Tennis Swing 18. Tennis Serve 19. Golf Swing 20. Pickup & Throw Evaluation Similar set of experiments were performed and once again the best accuracy was observed with s 90%, C 10, basislength 2, the results are shown in Figure 8a in the form of a confusion matrix. Figure 8.1. An average accuracy of almost 93% is achieved with understandable confusion in two different types of kicks (forward and side kicks). For s 95%, the accuracy goes down, indicating overfitting. 32

42 Figure 8.2. The average accuracy goes down when s is increased. Similarly on lowering s 75%, the accuracy again suffers drastically, indicating the WFAs have failed to model the dynamics Figure 8.3. The average accuracy again suffers when a smaller s is used. 33

43 5.4. Composable Activites Dataset Description The Composable Activities Dataset, introduced in [38] is a very different dataset as compared to the datasets discussed so far, since it s made up of sequences of complex activities, which in turn are made up of sub-activities. All in all, there are 693 sequences of 16 classes performed by 14 actors. Each composable action is made up of different combinations of 3 to 11 sub-activities out of a total of 26 activities. The Dataset exhibits high variance in the complexity as well as similarity of the sequences and as such is a difficult dataset for action recognition. Following are the 16 action classes for classification: 1. Composable Activity 1 2. Composable Activity 2 3. Composable Activity 3 4. Composable Activity 4 5. Composable Activity 5 6. Composable Activity 6 7. Composable Activity 7 8. Composable Activity 8 9. Hand Wave and Drink 10. Talk Phone and Drink 11. Talk Phone and Pickup 12. Talk Phone and Scratch Head 13. Walk while Calling with Hands 14. Walk while Clapping 15. Walk while Hand Waving 16. Walk while Reading The first 8 activities are composed of 3 to 11 sub-actions, most of the time performed sequentially, but sometimes performed in parallel. The sub-actions include reading, gesticulating, erasing/writing on a board etc. The authors provide skeleton data, and annotations. 34

44 Figure 9.1. A few examples of actions from the composable activities dataset, some actions are parallel, like top-left the subject walks while hand waving, top-right the subject talks on phone, and then runs sequentially. Since this is a comparatively harder dataset, it was harder to perform well on this dataset, the best available performance on this dataset was around 86% by the creators of the dataset [38] Evaluation Our best performance is similar to the Bag of Visual Words baseline mentioned by the authors, with s 95%, C 10 we were able to get an accuracy of around 67.83%. Figure 9.2. Confusion Matrix for Composable Activities dataset. We are able to do well in the first 8 activities which are composed of multiple activities. 35

45 As previously observed increasing s led to a decrease in performance Figure 9.3. A drastic decrease in average accuracy to about half with s 99%. Similarly, a significant decrease in s also lead to a similarly reduced recognition accuracy Figure 9.4. A decrease in accuracy is seen when s is reduced to 75% As before, we are not performing close to the best, however we were able to hit at least one baseline which goes on to show that the method, although not perfect, can be made viable. As a reference the confusion matrix obtained by [38] is shown here. We are actually able to outperform 36

46 them in recognizing some of the activities, like Composable Activities 5,6,7 (numbered the same in fig 9.2) Figure 9.5. Confusion Matrix obtained by [38] with around 85% Average Accuracy HDM05 Dataset Description The next dataset that we experiment on is the HDM05 dataset [55]. It is also a MoCap dataset which provides 3D locations, for 31 joints, However like [2] we also used just 4 joints corresponding to arms and legs. The results were again similar to the pattern followed in the previous experiments Evaluation The best recognition accuracy that we were able to achieve was 90.5%, with s94%, C 10. This is the only dataset on which we were able to outperform the state of the 37

47 art, however, as mentioned earlier we follow a leave one out protocol, while [2] follows a leave one out subject protocol, keeping that in mind, our performance is still not state of the art. Figure An accuracy of over 90% is achieved with s 94% and C 10 Following the pattern so far, a higher s leads to a degradation in performance. At s 99% the confusion matrix looks like this Figure The average accuracy drops to around 77% at s 99% Similarly, going down also leads to a drop in performance 38

Figure 10.3. A drop in performance is observed when s 85% is used.

48 Figure A drop in performance is observed when s 85% is used. Considering that we were able to achieve a higher performance than the one reported in [2] we re-did the experiments following there protocol, with leave one subject out, the performance dropped well below the state of the art, to around 71%. Figure Confusion Matrix for the best possible performance following protocol of [2] 39

5.6. UTKinect Dataset 5.6.1. Description The UTKinect Dataset [39] is another popular dataset used evaluated frequently in action recognition settings.

49 5.6. UTKinect Dataset Description The UTKinect Dataset [39] is another popular dataset used evaluated frequently in action recognition settings. It is also based on 3D skeleton joints, consists of 10 simple actions including: 1. Walk 2. Sit Down 3. Stand Up 4. Pick Up 5. Carry 6. Throw 7. Push 8. Pull 9. Wave Hands 10. Clap Hands Furthermore, each action is performed by 10 subjects twice. Leading to 199 sequences (with one missing sequence). Figure Some sample images from different actions from the UTKinect Dataset. 40

5.6.2. Evaluation Leave One Out protocol is itself proposed by [39] in this case, which we follow. Table 4 is picked up directly from [2] and shows the performance of different methods on the dataset.

50 Evaluation Leave One Out protocol is itself proposed by [39] in this case, which we follow. Table 4 is picked up directly from [2] and shows the performance of different methods on the dataset. While, once again this is a solved dataset now, we are able to perform significantly better than [1] and close to [4] Table 4. A comparison with different methods on UTKinect dataset, we are able to do reasonably well on most activities except carry and throw. ` Walk S.Dwn S.Up P.Up Carry Throw Push Pull Wave Clap Avg [1] [4] [39] [2] WFA (.95) The following is the confusion matrix for s 95%. We are able to perform reasonably well on all activities except carry and throw. Figure Confusion matrix for s 95% showing our best performance. 41

51 Bumping up s to 99% expectedly results in a drop in average accuracy. Figure Confusion matrix for s 99% showing a drop in average accuracy. Similarly, selecting a lower s also leads to a drastic drop in accuracy. Figure Confusion matrix for s 75% showing a drastic drop. 42

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria CHAPTER TWO LANGUAGES By Dr Zalmiyah Zakaria Languages Contents: 1. Strings and Languages 2. Finite Specification of Languages 3. Regular Sets and Expressions Sept2011 Theory of Computer Science 2 Strings