Probabilistic Siamese Network for Learning Representations. Chen Liu

Size: px
Start display at page:

Download "Probabilistic Siamese Network for Learning Representations. Chen Liu"

Transcription

1 Probabilistic Siamese Network for Learning Representations by Chen Liu A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2013 by Chen Liu

2 Abstract Probabilistic Siamese Network for Learning Representations Chen Liu Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2013 We explore the training of deep neural networks to produce vector representations using weakly labelled information in the form of binary similarity labels for pairs of training images. Previous methods such as siamese networks, IMAX and others, have used fixed cost functions such as L 1, L 2 -norms and mutual information to drive the representations of similar images together and different images apart. In this work, we formulate learning as maximizing the likelihood of binary similarity labels for pairs of input images, under a parameterized probabilistic similarity model. We describe and evaluate several forms of the similarity model that account for false positives and false negatives differently. We extract representations of MNIST, AT&T ORL and COIL-100 images and use them to obtain classification results. We compare these results with state-of-the-art techniques such as deep neural networks and convolutional neural networks. We also study our method from a dimensionality reduction prospective. ii

3 Acknowledgements I am thankful to Professor Brendan Frey whose guidance and support throughout the year made the completion of this project possible. I would also like to thank members of PSI lab, especially Hui, Andrew and Alice for their valuable help and advice. iii

4 Contents 1 Introduction Motivation Research Goals Thesis Organization Background Neural Network Siamese Network Optimization of the Neural Networks Gradient Descent and Mini-batch Stochastic Gradient Descent Momentum Probabilistic Siamese Networks Similarity Model Learning the Model Algorithms Experiments and Analysis Dataset Preprocessing Classification Comparison of similarity models using the ORL (AT&T) face data Results on MNIST handwritten digit classification Results on COIL-100 objects Dimensionality Reduction and Embedding Conclusions and Future Work 37 A Effect of momentum on different datasets 39 iv

5 List of Tables 2.1 Activation Functions Comparisons of convergence speed using momentum on the MNIST dataset Different forms of probabilistic similarity models Classification performance due to different similarity models on the AT&T face data Classification error rates on the MNIST dataset with various numbers of hidden units and different preprocessed features Classification error rates on the MNIST dataset with and without preprocessing using different numbers of hidden layers Classification error rates on the MNIST dataset. The first set of methods use similar/dissimilar labels to train the neural network. The second set of methods use the class labels during training Classification results on the COIL-100 dataset where images from the same class were labelled as similar. The second set of methods were trained directly using the class labels Classification results on COIL-100 images with different numbers of training images and with consecutive frames labelled as similar v

6 List of Figures 1.1 An illustration of different semantic similarities Basic Neural Network The Siamese Network with a pair of identical subnetworks, which generates hidden representations for hand-crafted metrics Gradient Descent Update The proposed method uses a similarity model, which generates a probability of P (s = 1) log P (s = 1 h 1, h 2 ) versus h 1 h 2 for different similarity models The derivative of log P (s = 1 h 1, h 2 ) w.r.t. h 1 h 2, back-propagated through the neural network during training Examples (one of each subject) from the AT&T ORL face dataset. Images were cropped to the same size with the eyes aligned. The illuminations were normalized with uniform backgrounds Randomly selected 36 images from the 60, 000-image training set of the MNIST dataset. All of the digits are centered with a 2-pixel boundary to the boarder, on a uniform background Classification accuracy with different numbers of training examples Example images (one from each category) from the COIL-100 dataset. All of the categories are centered with uniform background Dimensional embedding produced by PCA on the MNIST dataset Dimensional embedding produced by LDA on the MNIST dataset Dimensional embedding produced by a neural network model uses 1-of-N labels on the MNIST dataset. The embedding is produced using a network, and the embedding at hidden layer with 2 units is shown Dimensional embedding produced by proposed model on the MNIST dataset. The embedding is produced using a network, and the embedding at output layer with 2 output units is shown Image examples from the NORB (small) dataset An example of chain-like similarity used in the experiment with NORB data Embeddings produced by DrLIM and proposed model with full similarity (colours indicate varying azimuth) Embeddings produced by DrLIM and the proposed model with chain-like similarity. Colours indicate varying azimuth vi

7 4.13 Embeddings produced by DrLIM and proposed model with chain-like similarity. Colours indicate varying elevation A.1 Gradient Descent Update with and without momentum on MNIST A.2 Gradient Descent Update with and without momentum on AT&T faces A.3 Gradient Descent Update with and without momentum on COIL A.4 Gradient Descent Update with and without momentum on MNIST. Relu hidden unit activation function vii

8 Chapter 1 Introduction Pattern recognition has been a long-standing problem in machine learning research. The basic pattern recognition task is to find assignments of given inputs which are useful in an application. In the context of images, a pattern recognition task can be used to find the label of an object in a given image. Various methods have been studied for classifying images. These classification methods can be categorized into supervised (trained with fully labelled data), semi-supervised (train with weakly or partially labelled data), or unsupervised (train with unlabelled data) methods. Many of these methods capture fundamental differences between patterns for classification by using hand-crafted features, which is costly and time consuming. It is important to find methods that can automatically learn quality representations from large amounts of data, so that we can achieve better classification and recognition performance. Such an improvement would be beneficial in many applications, such as medicine, security and information technology. Under the framework of supervised learning, training a model with single image inputs is very difficult. It requires a large and fully labelled dataset to achieve good performance. In contrast, an unsupervised learning method can use unlabelled data. However, a good performance relies on sophisticated architectures with unsupervised learning, and the user has little control over the learned representations. Learning models using weak similarity labels for small collections of examples, such as pairs of input images, provides a means to discriminatively learn representations using deep architectures [1] without requiring completely labelled information [2, 3, 4, 5, 6, 7]. Furthermore, weakly labelled data can guide the algorithm to learn representations for different similarity definitions, information that cannot be easily obtained from unsupervised learning algorithms. Figure provided an example of difference in similarity definitions. If we are interested in distinguishing the object in the images, then the images in a) are similar because they both contain a tree. The images in b) are also similar because they both contain an airplane. The images in a) are considered to be dissimilar to the images in b) because they do not contain the same object. On the other hand, if we are interested in distinguishing the background of an image (scene classification), then all of the images in a) and b) are considered to be similar because they all have the same background. The underlying similarity definition allows us to tailor the representations to the precise aspect of the images that we are interested in, without the need for complex or dedicated algorithms. In general, there are two ways to utilize similarity information. The first approach involves meth- 1 The images were synthesized using [8] and [9]. 1

9 Chapter 1. Introduction 2 a) b) Figure 1.1: An illustration of different semantic similarities. ods that use similarity measurements. These methods use simple local adaptations of neighbourhood functions such as Euclidean distance or Mahalanobis distance (c.f. [10, 11, 12]). The second approach involves learning a non-linear function, such as a deep neural network that maps the input to a vector representation suitable for discriminating pairs of similar or dissimilar inputs. For example, in IMAX [2] a model is trained in order to maximize the mutual information between the pairs of representations obtained from similar input examples. Another example of this second approach is the siamese architecture [3], in which pairs of inputs are processed by two copies of the same neural network. The pair of networks are trained to make their output vectors similar for input pairs that are labelled as similar, and dissimilar for input pairs that are labelled as dissimilar. Under this setting, discriminative tasks can be performed using weakly labelled data. A more detailed description of siamese networks is given in Chapter Motivation The discriminative task in the siamese network framework can be achieved by learning well-separated hidden representations. Well-separated means that, in terms of the Euclidean distance, the hidden representations of similar input pairs should be close together and the hidden representations of dissimilar pairs should be far apart. To the best of our knowledge, prior works using the siamese network have made use of hand-chosen, fixed cost functions to achieve the desired separation in the hidden representatinos. Examples of such cost functions include mutual information [2], L 1 and L 2 norms [3, 7], and combinations using thresholds that clip off the tails of the metric distributions [6, 13]. An important issue in the above approach is that the most common cost functions are not normalized, which can bias the learned hidden representations in undesirable ways. Consequently, even in the extreme

10 Chapter 1. Introduction 3 case where the similarity labels are independent of the input pairs, certain hidden representations will still be preferred. For example, in [6] the neural network is trained to minimize the L 1 distance between hidden representations for similar pairs, and to maximize a clipped L 1 distance for dissimilar pairs. These two label-dependent costs are C 1 = d and C 0 = max(0, δ d), where d is the L 1 norm of the difference in hidden representations and δ is a positive hand-chosen parameter. For d = δ, these are C 1 = δ and C 0 = 0, while for d = 2δ, they are C 1 = 2δ and C 0 = 0. Thus, if the labels are independent of the input, representations with d = δ will be preferred over representations with d = 2δ. Furthermore, without normalization it is hard to consider how we should decide which of the two pairs of test images is more likely to be similar. Consider scores such as C 1, C 1 C 0, C 1 /(C 1 + C 0 ), 1/(1 + e C1 C0 ), it is not clear which one is the most appropriate to use for assessing similarity since they are all possible candidates. 1.2 Research Goals To the best of our knowledge, no prior work exists that addresses the issues arising from hand-crafted cost functions and the lack of normalization in siamese networks. In this thesis, we formulate a probabilistic similarity model in order to eliminate hand-crafted cost functions and provide normalization. The proposed similarity model takes the pairs of hidden vector representations generated by the neural networks as input and produces a normalized probability distribution over the similarity label. The proposed probabilistic model is based on maximizing the likelihood of the similarity labels for pairs of input images. We explore several probability models and experimentally evaluate their performances. We consider models that correspond to fixed cost functions including Euclidean distance [3] and the clipped L 1 norm, as well as a number of novel models. The proposed method is compared to existing algorithms on several standard datasets and their performances are evaluated and discussed. Finally, the proposed algorithm is studied from a dimensionality reduction perspective. 1.3 Thesis Organization In Chapter 2, we review the topics of neural networks, siamese network, neural network training, and the effects of different activation functions and momentum on neural network training. In Chapter 3, a detailed discussion of the proposed method is presented. The theoretical motivation of our work is further explained. The optimization algorithm for the proposed method is described in detail. Chapter 4 overviews the experimental setup and provides performance results for our method on several standard datasets, including AT&T face data, MNIST and COIL-100. The results are benchmarked against well-known algorithms such as Convolutional Neural Networks (CNN), Principle Component Analysis (PCA), and K-Nearest Neighbour (K-NN). Chapter 5 makes conclusions and discusses possible future directions.

11 Chapter 2 Background In this chapter we introduce the fundamentals of neural networks and its extension to siamese networks. The optimization techniques used for neural network training are described, focusing on the gradient descent method. Lastly, we discuss methods for accelerating the optimization procedure such as stochastic gradient descent with mini-batch and momentum. 2.1 Neural Network A neural network is a biologically inspired mathematical model consisting of artificial nodes called neurons or units, and interconnections that connect these neurons together to mimic a biological neural network. A neural network can be used to approximate nearly all functions. It is an adaptive system, in the sense that the interconnections can be changed by using a learning procedure. There are many variants of neural networks, based on different combinations of three essential elements: Neurons Weighted interconnections Activation functions This is illustrated by the simple one hidden-layer neural network shown in Figure 2.1. Let the input to the neural network be the length i vector X = {x 1, x 2,..., x i } and the output of the network be the length k vector Y = {y 1, y 2,..., y k }. Let W (l) represent the matrix of interconnection weights used to linearly combine the elements in the input vector to generate inputs to the l-th layer. For example, entry w j,i in the matrix W (l) is the weight between the i-th element of the input vector and the j-th unit in the l-th layer. Let b (l) represent the bias terms of the l-th layer. For a two-layer neural network (with one hidden layer), the output layer is denoted as l = 2. The hidden activity in the hidden layer l 1 = 1 can be written as a (l 1) = (hidden activity) (l 1) = f(w (l 1) X + b (l 1) ) (2.1) 4

12 Input Output Chapter 2. Background 5 Hidden Layer Figure 2.1: Basic Neural Network

13 Chapter 2. Background 6 where f( ) is an activation function. This activation can be propagated to the l-th layer and produce the activation for layer l, i.e. a (l). In our example, the activation for layer l is also the output Y given by Y = a (l) = f(w (l) f(w (l 1) X + b (l 1) ) + b (l) ) = f(w (l) a (l 1) + b (l) ). (2.2) In most neural network applications, an error (or cost) function is defined in terms of the inputs, outputs, and network parameters. The neural network is trained by adapting its interconnection weights to minimize the error function. A solution for this optimization problem is to compute the gradient of the error function with respect to the network parameters, layer by layer, from the output layer to input layer. This algorithm is commonly referred as back-propagation in neural network literature. The method of gradient descent is often used to perform network parameters updates and to find a solution to this optimization problem. We will provide details on the back-propagation algorithm in Section 2.3. In practice, many difficult choices must be made in order to use neural networks effectively. These choices include selecting the best network architecture and the best activation function. In general, when the application is designed for a large amount of training data, more hidden units can provide better performance. When only small number of training data is available, one can still choose to use large number of hidden units, but regularization might needed to prevent overfitting. Different activation functions generally achieve similar accuracies, unless specific requirements exist on the activation for the given task. However, different activation functions can lead to very different training speeds and computation times. Table 2.1 lists several commonly used activation functions, and z is denoted as an N-dimensional input to the activation functions. The description of training speeds shown in Table 2.1 are based on common practice. The computation times (in brackets) correspond to calculating one forward pass and one derivative of a random matrix on a 2.20 GHz single core CPU. Table 2.1: Activation Functions Name Function Training Computation Sigmoid σ = 1 1+exp ( z) Slow Fast (0.0140) Softmax S n = exp (z n) N n=1 exp (zn) Slow Slow (0.2143) Tanh tanh (z) = exp (z) exp ( z) exp (z)+exp ( z) Medium Medium (0.0246) Rectified Linear Units (ReLU) ReLU = max (0, z) Fast Fast (0.0036) 2.2 Siamese Network A siamese network (or siamese architecture) is a particular neural network architecture consisting of two identical sub-networks with shared parameters, often used in a semi-supervised setting. An example of a siamese network is shown in Figure 2.2. The siamese architecture [3] was first proposed by using a pair of identical neural networks (with

14 Chapter 2. Background 7 shared parameters) that utilized similarity label information between inputs (as shown in Figure 2.2) for signature verification. The network could be trained to make the output vectors similar for input pairs that were labelled as similar and not similar for the input pairs that were labelled as dissimilar. More recently, the siamese network had been modified and applied to other tasks such as text classification [4] and speech feature classification [7]. In [5], the network was used for face verification. The authors achieved good performance by adapting the convolutional neural network architecture with an energy based model and contractive loss. Mobahi et al. [6] proposed a similar siamese architecture for learning from video data. Pairs of input images (frames) were labelled as similar if they were close in time, and dissimilar if they were distant in time. A two-term cost function was used. The first term was a discriminative supervised function. The second term was designed to use semi-supervised similarity label information. In Chapter 1, we briefly described semi-supervised cost functions that use similarity information, and discussed several weaknesses in this type of cost functions. The semi-supervised cost function in [6] is given by C s=1 = h 1 h 2 (2.3) C s=0 = max (0, δ h 1 h 2 ) (2.4) where s is a binary-valued similarity label, h 1 and h 2 are the hidden representations of a pair of inputs, and δ is a positive, hand-chosen parameter. 2.3 Optimization of the Neural Networks Gradient Descent and Mini-batch Stochastic Gradient Descent The cost functions arising from neural network optimization (such as squared error, maximum likelihood etc.) can be optimized by finding the minimum. In practice, since the dimension of the network weights can be very large and there are many local minimum, a good local minimum can achieve fairly good performance. The gradient descent (GD) algorithm is an iterative, first order optimization technique that can be applied directly to neural network training. At each iteration, gradient descent computes the gradient of the cost function E( ) with respect to the network parameters, and updates them according to W t+1 = W t η E(W t ) (2.5) where t denotes the iteration number and E(W t ) is calculated over all training examples. For a sufficiently small step size η, the algorithm is guaranteed to find a minimum on the error surface. An illustration of the gradient descent method is shown in Figure 2.3. The gradient descent method contains several drawbacks when the amount of training data is large. First, it is expensive to compute the overall gradient at each iteration. Second, in finding the best network parameters, the optimization process tends to neglect portions of the training data, i.e. the network tends to find the best average parameters. The stochastic gradient descent (SGD) algorithm was developed to improve on gradient descent in the presence of a large amount of training data. In SGD, the true gradient of the error function is approximated by calculating the gradient based on a single training example X m

15 Input 2 Hidden representation 2 Input 1 Hidden representation 1 Chapter 2. Background 8 Shared parameters Hand crafted metrics Figure 2.2: The Siamese Network with a pair of identical subnetworks, which generates hidden representations for hand-crafted metrics.

16 Chapter 2. Background 9 Figure 2.3: Gradient Descent Update and the parameters are updated according to Algorithm 1. Algorithm 1 Stochastic Gradient Descent Input: Randomly initialized W 1: for every epoch do 2: Randomly shuffle the training examples 3: for m = 1,..., M do 4: Compute E(W t, X m ) 5: W t+1 = W t η E(W t, X m ) 6: end for 7: end for W t+1 = W t η E(W t, X m ), (2.6) In addition, instead of training on a single example at each iteration, the SGD algorithm can be modified to obtain its gradient by using mini-batches of training examples. Typically, a mini-batch consists of examples [14]. In mini-batch SGD, Algorithm 1 is placed within a loop over the mini-batches, with the gradient calculated from the training examples within each mini-batch. The benefits of mini-batch SGD are two-fold. First, it is cheaper to compute an approximate gradient for a small subset of training examples than GD on the entire dataset. First, mini-batch SGD is less noisy than SGD, as it accounts for more examples per updates. Furthermore, in every epoch, training examples are reshuffled. Therefore, different training examples are grouped into several different subsets compared to the previous epoch. The network parameters are optimized in different directions at each update, and walked multiple steps on the gradient within one epoch. It is more likely for the network to learn from the examples that are ignored in regular GD. In the end, the algorithm converges faster to a better local optimum than regular gradient descent [14] Momentum Momentum methods [15, 16] for accelerating stochastic gradient convergence has been widely studied [17, 18, 15]. In a gradient descent with momentum, the current update does not only depend on the current gradient but also depends on previous updates. When there is a long and narrow valley in the error surface, gradient descent will oscillate back and forth in the minor axis. The momentum term can set a weighted combination of previous velocity and drive the update towards the minimum at a faster

17 Chapter 2. Background 10 rate. Other commonly used accelerating methods such as conjugate gradient and L-BFGS require a large amount of memory for computations. Therefore, momentum is a good low cost solution for speeding up the learning process in neural networks. Define W t = W t W t 1, then regular momentum is given by W t = η E(W t ) γ W t 1. (2.7) A variation on regular momentum called nesterov momentum was discussed in [19, 15]. If we let then the nesterov momentum modifies regular momentum to ξ t = W t 1 η E(W t 1 ), (2.8) W t = (1 + γ)ξ t γξ t 1 (2.9) where ξ 0 is initialized to zero. We used a toy problem and conducted experiments for comparing the effect of different types of momentum, the results are shown in Table 2.2. The experiments were constructed using the squarederror function (given 1-of-N coding as the target) on MNIST data using a one hidden layer neural network (sigmoid activation functions, 250 hidden units). Twenty thousand images were sampled from the MNIST dataset for training and 10, 000 images were used for validation. The number in Table 2.2 shows the iteration when the validation accuracy reached 90%. For all three experiments, the same training and validation datasets were used, and the initialization, network architecture and cost functions were the same. By using regular momentum with gradient descent, a 3x speed up in convergence speed was achieved with sigmoid hidden units. By further modifying the momentum to nesterov momentum a 2x speed up was achieved over regular momentum on the MNIST dataset. However, the gain of using nesterov momentum is data dependent. Please refer to Appendix A for details on experiments using nesterov momentum on other datasets. Finally, momentum scheduling [20] is yet another simple method to further improve momentum gains. An example of which is a trapezoid schedule, which consists of 1. Set an initial momentum value 0 < γ 0 < 1 2. Gradually increase the momentum to a large number less than 1 (such as 0.99) 3. After few hundred epochs, gradually decrease the momentum down to γ 1 Note that it is not necessary for the momentum term to increase linearly. For example, the momentum can also be increased exponentially, or as a quadratic function over iterations. Table 2.2: Comparisons of convergence speed using momentum on the MNIST dataset Name Gradient Descent Gradient Descent + Regular Momentum Gradient Descent + Nesterov Momentum Iterations to achieve 90% Accuracy > 800 iterations 286 iterations 144 iterations

18 Chapter 3 Probabilistic Siamese Networks In this chapter, we describe the proposed algorithm based on siamese networks. We present a number of different probabilistic similarity models and study their strengths and weaknesses. The learning procedure is discussed in detail and optimization updates are derived to allow reproduction of our results. 3.1 Similarity Model We assume that a neural network with parameters θ is used to map an input image X to an N- dimensional hidden vector representation h, where h = f(x, θ). Figure 3.1 illustrates the proposed architecture. During training, image pairs X 1 and X 2 are provided and the network is used to generate corresponding hidden vectors h 1 and h 2. For each input pair, a binary label s indicating whether the pair is similar (s = 1) or not (s = 0) is also provided. For instance, given the example in Figure 1.1 in Chapter 1 for object category classification, the pair of images in a) would have the similarity label s = 1 and the pair of images in b) would also have the similarity label s = 1. A pair of images with one image from a) and one image from b) would have the similarity label s = 0. It is our goal to train the neural networks so that the hidden vector representations for similar input pairs are close, and the hidden vector representations for dissimilar input pairs are clearly distinguishable. The probabilistic similarity models have as input the two hidden representations h 1 and h 2 and outputs a distribution over the similarity label s, P (s h 1, h 2 ). Using probabilistic models allow for the normalization of the two possible predictions. We formulate the learning problem as one of maximizing the conditional likelihood of the similarity labels under a similarity model. The similarity models are parameterized by a positive vector α so that P (s = 1 h 1, h 2 ) = g(h 1, h 2, α), and P (s = 0 h 1, h 2 ) = 1 g(h 1, h 2, α). Apart from the trivial constraint for the range of g to be in [0, 1], there are several additional important properties for the similarity models: It should be symmetric in h 1 and h 2, i.e. g(h 1, h 2, α) = g(h 2, h 1, α), since we assume that the input pairs are not given in any particular order. The maximum similarity should only occur when h 1 = h 2. In particular, we require that g(h 1, h 2, α) g(h 1, h 1, α) and g(h 1, h 2, α) g(h 2, h 2, α). 11

19 Input 2 Hidden representation 2 Input 1 Hidden representation 1 Chapter 3. Probabilistic Siamese Networks 12 Shared parameters h 1,n h 2,n l Similarity label Figure 3.1: The proposed method uses a similarity model, which generates a probability of P (s = 1). The similarity model should not be too flexible, or it may take over the task that the preceding neural network is meant to achieve. We consider models that depend on weighted norms of the form d = N n=1 α n h 1,n h 2,n l, where the α s are positive coefficients that are optimized during learning. Table 3.1 lists the similarity models that we considered in this thesis. For ease of reference, we named the models by the shape of the probability P (s = 1 h 1, h 2 ), i.e. g(h 1, h 2, α), as a function of d = N n=1 α n h 1,n h 2,n l. For example, the Gaussian model does not, in fact, make uses of a Gaussian distribution, but the form of the dependence of P on d is that of a Gaussian density function. In the first two models, named Gaussian and flattened Gaussian, log P decreases quadratically for large d. The latter is more flat than the Gaussian model for small d. In the Laplacian model log P decreases linearly with d. The probabilistic clipped L 1 is a normalized version of the clipped L 1 cost function described previously (inspired by [6]). We will compare the representations obtained using the normalized and unnormalized versions of the L 1 cost function later in the thesis. Lastly, in the Cauchy model log P decreases as 1/d 2, which is slower than all other models. Figure 3.2 shows plots of log P with respect to d for these models, with roughly aligned curvature when d is small. Figure 3.3 shows plots of their derivatives, which are back-propagated through the neural network during training.

20 Chapter 3. Probabilistic Siamese Networks Log Probability Gaussian Flattened Gaussian 7 Laplacian Probabilistic Clipped L1 (Mobahi et al.*) Cauchy h 1 h 2 Figure 3.2: log P (s = 1 h 1, h 2 ) versus h 1 h 2 for different similarity models.

21 Chapter 3. Probabilistic Siamese Networks First Derivative of Log Probability Gaussian Flattened Gaussian Laplacian Probabilistic Clipped L1 (δ =3, Mobahi et al.*) Probabilistic Clipped L1 (δ =6, Mobahi et al.*) Cauchy h 1 h 2 Figure 3.3: The derivative of log P (s = 1 h 1, h 2 ) w.r.t. h 1 h 2, back-propagated through the neural network during training.

22 Chapter 3. Probabilistic Siamese Networks 15 Table 3.1: Different forms of probabilistic similarity models. Model P (s = 1 h 1, h 2 ) ( Gaussian exp ) N n=1 α n(h 1,n h 2,n ) 2 ( ( N )) Flattened Gaussian (β + 1)/ β + exp n=1 α n(h 1,n h 2,n ) 2 ( Laplacian exp ) N n=1 α n h 1,n h 2,n Probabilistic clipped L 1 Cauchy e d /(e d + e max(0,δ d) ), d = N n=1 α n h 1,n h 2,n 1/(1 + N n=1 α n(h 1,n h 2,n ) 2 ) From Figure 3.3, we can already see that models such as the Gaussian, flattened Gaussian or probabilistic clipped L 1 likely will not perform well. If a pair of similar images are far apart in the hiddenrepresentation space, the derivative of the Gaussian model and flattened Gaussian model will be very large, which can bias the model to fit to this particular pair. This issue is addressed by the probabilistic clipped L 1 model, but not as well as in models such as the Laplacian and Cauchy. The derivative of the Laplacian model depends on the sign of d. For all non-zero values of d, the model will penalize all data pairs equally. Thus, even though the Laplacian model can better handle outliers, it most likely will not fit to the good examples very well. Finally the Cauchy model penalizes outliers less severely than all of the other proposed model to a true outliers, and the model creates different the penalization power to typical examples ( unlike in Laplacian). Therefore, we expect the Cauchy model to provide the best performance in our experiments, which are reported in Chapter Learning the Model The training set consists of M pairs of input examples (X m 1, X m 2 ), m = 1,..., M, and corresponding similarity labels s m, m = 1,..., M. In training, we maximize the conditional probability of both s = 1 and s = 0 given the hidden representations. The conditional likelihood is given by P (s 1, s 2,..., s M h 1 1, h 1 2,... ; α) = M m=1 ( g(h m 1, h m 2, α) (sm) (1 g(h m 1, h m 2, α)) (1 sm ) ) (3.1) where h m 1 = f(x m 1, θ) and h m 2 = f(x m 2, θ). Taking the natural logarithm, we have L(s 1, s 2,..., s M h 1 1, h 1 2,... ; α) = M m=1 ( ) s m log g(h m 1, h m 2, α) + (1 s m ) log(1 g(h m 1, h m 2, α)). (3.2) For the 2-layer neural network shown in Figure 3.1, we calculate the hidden representation for X1 m and X2 m by calculating the forward pass values: a (1)m 1 = f(z (1)m 1 ) = f(w (1) X m 1 + b (1) ) (3.3)

23 Chapter 3. Probabilistic Siamese Networks 16 h m 1 h m 2 a (1)m 2 = f(z (1)m 2 ) = f(w (1) X m 2 + b (1) ) (3.4) = a (2)m 1 = f(z (2)m 1 ) = f(w (2) a (1)m 1 + b (2) ) (3.5) = a (2)m 2 = f(z (2)m 2 ) = f(w (2) a (1)m 2 + b (2) ) (3.6) where a (1)m 1 is the activation of layer 1 for the 1st image in the m-th pair of input images, a (1)m 2 is the activation of layer 1 for the 2-nd image in the m-th pair, b (1) represents the bias of the input layer, and b (2) represents the bias of the output layer. Note that here a superscript with brackets is the index of the hidden layer, a superscript without brackets is the training case index and a subscript indicates the image within a pair of images.this notation applies to all the equations listed in Section 3.2, except equation 3.1. After calculating h m 1 and h m 2, the log-likelihood can be computed. Then the similarity model parameters α and network parameters are jointly optimized. The gradient of the log-likelihood with respect to the similarity model parameters is given by L M α = m=1 s m g m g m (1 g m ) ( g(h m 1, h m 2, α) α ), (3.7) where g m is the current vector representation for the m-th training case, g m = g(h m 1, h m 2, α). The neural network is trained by back-propagating the derivatives through both h m 1 and h m 2. For example, to find the gradient with respect to the output layer weights W (2) we first take the derivative with respect to h 1 and h 2 L h m 1 L h m 2 = = M m=1 M m=1 s m g m g m (1 g m ) s m g m g m (1 g m ) ( g(h m 1, h m 2, α) h m 1 ( g(h m 1, h m 2, α) h m 2 ), (3.8) ). (3.9) Then, according to the chain rule, the derivative with respect to W (2) is given by L W (2) = L h m 1 with the last back-propagation step given by h m 1 L + W (2) h m 2 h m 1 W (2) = hm 1 z (2)m 1 h m 2 W (2) = hm 2 z (2)m 2 z (2)m 1 h m 2, (3.10) W (2), (3.11) W (2) z (2)m 2. (3.12) W (2) The back-propagation expressions for the input weights W (1) can be obtained in a similar manner. They are L W (1) = L z (1)m 1 L z (1)m 1 z (1)m 1 W (1) + = L h m 1 L z (1)m 2 h m 1 z (1)m 1 z (1)m 2, (3.13) W (1), (3.14)

24 Chapter 3. Probabilistic Siamese Networks 17 L z (1)m 2 = L h m 2 h m 2 z (1)m 2. (3.15) Back-propagation can be easily generalized to l layers, where l {1, 2,..., }. The complete set of back-propagation expressions for the siamese network are given as follows: For the output layer τ: For layers l = 1, 2,..., τ 1: δ (τ) 1 = L h m 1 δ (τ) 2 = L h m 2 f (z (τ)m 1 ) (3.16) f (z (τ)m 2 ) (3.17) δ (l) 1 = (W (l) δ (l+1) 1 )f (z (l)m 1 ) (3.18) δ (l) 2 = (W (l) δ (l+1) 2 )f (z (l)m 2 ) (3.19) For parameter updates: W (l) = W (l) η(a (l) 1 δ(l+1) 1 + a (l) 2 δ(l+1) 2 ) (3.20) b (l) = b (l) η(δ (l+1) 1 + δ (l+1) 2 ) (3.21) We can modify the above to equations using (2.8) and (2.9) to incorporate momentum to speed up the updates. 3.3 Algorithms The algorithm first creates training pairs on the fly and divides them into mini-batches with an equal number of similar and dissimilar pairs. Then, for each batch we perform a forward propagation to obtain the cost. Finally, we calculate the errors for back propagation, and update the network parameters (along with the momentum according to its schedule). The proposed method is formally summarized in Algorithm 2 using the notation specified in the previous section, and the network parameters and conditional probability parameters are updated jointly.

25 Chapter 3. Probabilistic Siamese Networks 18 Algorithm 2 SGD learning of neural network and similarity model Input: Training pairs (X1 m, X2 m ) and labels s m, m = 1,..., M, architecture of the neural network and the similarity model, randomly initialized parameter values, learning rate and momentum schedule. Output: Optimized network parameters, scaling parameter α 1: while termination condition not satisfied do 2: Create training pairs on-the-fly 3: for batch number = 1... B do 4: Sample a mini-batch of K training examples 5: for k = 1,..., K do 6: Forward propagation: 7: Compute h k 1, h k 2 and g k 8: Backward propagation: 9: Compute and accumulate the derivatives w.r.t. the similarity model parameters α, i.e. L α use (Eqn. 3.7) 10: Compute and accumulate the derivatives w.r.t. the neural network parameters θ using back L L propagation, i.e. (Eqn.3.10),,(Eqn.3.13),... W (2) W (1) 11: end for 12: Update the parameter velocities using momentum and the accumulated derivatives 13: Update the parameters using their velocities 14: end for 15: end while

26 Chapter 4 Experiments and Analysis In this section, we assess the performance of our proposed similarity models with siamese networks. We conducted experiments using AT&T (ORL) face data, MNIST, and COIL datasets. We examined different versions of the proposed model and directly compared with existing models such as [6]. We also evaluated the proposed model in the context of dimensionality reduction. 4.1 Dataset Preprocessing Prior to applying machine-learning algorithms to images for training and testing purposes, data preprocessing can be performed on the images. The reasons for data preprocessing include noise reduction, dimensionality reduction, and feature extraction. Common preprocessing techniques include image whitening, Principle Component Analysis (PCA), Histogram of Oriented Gradients (HOG)[21], and Scale-Invariant Feature Transform (SIFT)[22] feature extraction. In this work, we used the PCA method to preprocess the images in a number of experiments. The PCA method was chosen because of its fast runtime, ease of implementation, and the availability of reported results for benchmarking. PCA is an unsupervised preprocessing algorithm that uses an orthogonal transformation to convert a set of data vectors into linearly uncorrelated components referred to as the principle components. Mathematically, given a set of zero-mean input vectors X R N M and an appropriate K N-dimensional matrix of orthogonal basis B with K N, we have PCA(X) = BX. (4.1) For dimensionality reduction, K is chosen to be a much smaller value than N. The key to PCA is to find the appropriate set of K orthogonal basis vectors that minimize the loss due to the linear transformation X. Formally, we perform the optimization: where x are the columns of X. B = arg min B E{ x BT (PCA(x)) 2 } (4.2) To find the basis vectors for PCA we take the following steps: 1. Center the input data 19

27 Chapter 4. Experiments and Analysis Compute the covariance matrix of the input data 3. Find the eigenvalues and eigenvectors of the covariance matrix 4. Select the eigenvectors that correspond to the K largest eigenvalues as the principle components 4.2 Classification Although the proposed model produces a single probability value, the output layer hidden-unit activities can be used for classification. Some of the classification tasks are achieved using a L 2 -Support Vector Machine (L 2 SVM) [23, 24]. In a L 2 SVM, given a set of input {x 1, x 2,... x i, x I }, and corresponding class labels {y 1, y 2,... y i, y I }, we perform classification by solving min w { 1 2 w 2 + C } I max(0, 1 y i wx i ) 2 where parameter C is chosen by validation in each experiment. For the experiment with MNIST, a softmax classifier is also used. i=1 (4.3) A softmax classifier can be considered as a one-layer neural network ( no hidden layer) with a softmax activation function (Table 2.1). Given input X and class label Y, the cost function is defined as E = 1 M Class {Y m = j} log(f(w X m )) (4.4) m m=1 j=1 where m indexed the image, there are total of M images. W is the weight, {Y m = j} behaves the same as an indicator function, and f( ) is the softmax activation function. In this work, we have chosen simple classification methods for the following reasons: 1. A specialized classifier for improving performance is not required 2. Data with more categories are relatively cheaper and easier to process with SVM 3. To demonstrate the ability to deepen the network for classification tasks using a softmax classifier. 4.3 Comparison of similarity models using the ORL (AT&T) face data We used the ORL (AT&T) face dataset to evaluate our proposed similarity models. ORL face data [25] consisted of 400 gray-scale images of 40 subjects taken from 10 different views, with the eyes aligned. The original image size was pixels, which we down-sampled to pixels. We created a base training set by randomly selecting 5 views of each subject (200 images in total) and set aside the remaining data for testing. Figure 4.1 shows several example images, one from each subject in the dataset. We used the mini-batch stochastic gradient descent algorithm for training. Each mini-batch was generated on-the-fly by randomly selecting 400 training pairs from the 200 base training images, so that

28 Chapter 4. Experiments and Analysis 21 Figure 4.1: Examples (one of each subject) from the AT&T ORL face dataset. Images were cropped to the same size with the eyes aligned. The illuminations were normalized with uniform backgrounds.

29 Chapter 4. Experiments and Analysis 22 Table 4.1: Classification performance due to different similarity models on the AT&T face data. Model Accuracy (Std.) Gaussian 87.02%(0.71) Flattened Gaussian 92.40%(1.80) Laplacian 86.70%(2.80) Clipped L 1 [6] 91.80%(1.88) Probabilistic clipped L %(3.20) Cauchy 94.30%(1.70) CNN trained using class labels [6] 94.05% an equal number of similar and dissimilar pairs are chosen. The trapezoidal momentum schedule was used with Nesterov momentum. All of the results were obtained using a layered architecture. In the sequel, a layered architecture means a neural network with an input size of 784 pixels, 1000 hidden units, 20 output units, and a binary similarity label at the last layer. For each model, training was repeated 5 times with a different random initialization of network weights and different training and test splits each time, in order to estimate the confidence intervals. For the clipped L 1 and the probabilistic clipped L 1 similarity models, we performed a search over the δ parameter using held out validation data. After training the neural network, the hidden representations from the 20 output units were fed into a L 2 SVM for classification. The best average results are reported. Table 4.1 lists the classification accuracy under different similarity models and a state-of-the-art result obtained using a convolutional neural network [6]. The bracketed values are standard deviations of the accuracy values. In this work, we do not focus on biometric identification, thus the overall classification accuracy is inclusively reported in the table. From Table 4.1, it is shown that the Cauchy conditional probability model outperforms the other similarity models. This is expected since the Gaussian and Laplacian models do not have the appropriate tail shape, and the Cauchy model is more robust to outliers due to its tail shape. As we have argued in Chapter 3, the clipped L 1 and flattened Gaussian models do not have the appropriate first order derivatives, this is also confirmed by the experimental results. Furthermore, the Cauchy model shows a performance similar to the convolutional neural network, which is trained using the class labels themselves rather than using weak similar/dissimilar labels. This indicates that with a suitable probabilistic model it is not necessary to use a complex architecture to learn better representations. Based on the results in Table 4.1, we have decided to use the Cauchy similarity model for the remaining experiments in this chapter. 4.4 Results on MNIST handwritten digit classification The second experiment was performed using the MNIST handwritten digits dataset. The MNIST handwritten digit dataset was collected by Dr. Yann Lecun [26], and became a popular benchmarking dataset for classification and neural network based algorithms. The MNIST handwritten digit dataset consisted of 70, grey-scale images of the digits 0 to 9 with a variety of writing styles. Figure 4.2 shows example images from the dataset.

30 Chapter 4. Experiments and Analysis 23 Figure 4.2: Randomly selected 36 images from the 60, 000-image training set of the MNIST dataset. All of the digits are centered with a 2-pixel boundary to the boarder, on a uniform background. Compared to the dataset used in the previous experiment, MNIST contains a smaller number of classes but has more variations within each class. It also contains a large number of training images. With the MNIST dataset, an important question to study is whether our model can perform better with more data, and if so, how much data is enough. MNIST is a relatively simple dataset. A large number of algorithms can perform well with a small number of training data, while no gains are obtained by using more training examples. In our study, we conducted experiments using different numbers of the training examples for the proposed algorithm on a network with a architecture on the original MNIST dataset. The MNIST training data consisted of 60, 000 images and the test dataset consisted of 10, 000 images. We randomly sampled 5000, 10000, 20000, and 50, 000 training examples from the training set, and tested on the complete testing set. The resulting classification error using the hidden representations with a L 2 SVM is plotted against the number of training examples and shown in Figure 4.3. Interestingly, the algorithm performed worse than a simple classifier when there were a small number of training examples. For example, a one hidden-layer neural network classifier of 60 hidden units and no gradient acceleration could achieve around 7% error rate on MNIST. However, the algorithm showed a nearly linear performance gain with increasing training data. This is a desirable property, since our model can use all of the available training data to maximize its performance. On the other hand, the performance our model may not be optimal when only a small amount of training data available. Other factors that can affect the performance of the proposed model are the number of hidden units and data preprocessing. In general, it is known that more hidden units and preprocessing are beneficial to representation learning. We trained the proposed algorithm with different numbers of hidden units on the original MNIST dataset, and the same dataset with preprocessing using 30 PCA components plus quadratics and 40 PCA components plus quadratics. By quadratics we mean the product of every two projections on principle components. It is shown in Table 4.2 that the proposed algorithm works better

31 Chapter 4. Experiments and Analysis Siamese network with Cauchy similarity model 14 Error ( % percentage) Number of training examples x 10 4 Figure 4.3: Classification accuracy with different numbers of training examples

32 Chapter 4. Experiments and Analysis 25 with more hidden units and 40 PCA components plus quadratics. Note that in Table 4.2, Input is 784 for images without preprocessing, 465 and 840 for 30 PCA + Quadratics and 40 PCA + Quadratics respectively. Table 4.2: Classification error rates on the MNIST dataset with various numbers of hidden units and different preprocessed features. Preprocessing Input Input Input None 4.27% 2.80% 2.60% 30 PCA + Quadratics 2.80% 2.70% 2.34% 40 PCA + Quadratics 2.68% 2.1% 1.89% Based on the results obtained from the above experiments, in the remaining MNIST based experiments, we randomly selected 50, 000 training images as our base training set and used the remaining 10, 000 training images for validation. For training, we randomly created 50, 000 similar pairs that had the same digit, plus 50, 000 dissimilar pairs that had different digits. The network was trained by using mini-batch stochastic gradient descent with pairs taken from the 100, 000 base training pairs. Nesterov momentum was used with a trapezoid momentum schedule. We report the results on the complete set of 10, 000 testing images. The proposed algorithm is applicable to both shallow (less than 2 hidden layers) or deep network architecture (more than 2 hidden layers). Thus, it is important to find out how many layers is considered to be enough for good classification results. The classification results of the proposed model with different hidden-layer depths are listed in Table 4.3. We considered two types of feature vectors: the original image raster scanned into a 784-dimension vector, and features derived using 40 PCA components together with quadratic combinations, forming a feature vector of length 840 [26]. In Table 4.3, no hidden layer means the network has a 784/ architecture, one hidden layer means the network has a 784/ architecture, and two hidden layers means the network has a 784/ architecture without pre-training. Table 4.3: Classification error rates on the MNIST dataset with and without preprocessing using different numbers of hidden layers. Number of Hidden Layers Preprocessing Error Rate (Std.) No hidden layer None > 10% No hidden layer 40 PCA + Quadratics > 5% One hidden layer None 2.60%(0.21) One hidden layer 40 PCA + Quadratics 1.93%(0.14) Two hidden layers None > 5% Two hidden layers 40 PCA + Quadratics > 5% Table 4.3 shows that depending on the data and algorithm, a deep architecture is not necessarily better than a shallow architecture. In fact, for the proposed model on the MNIST dataset, a single

33 Chapter 4. Experiments and Analysis 26 hidden layer architecture seems to be the most efficient. Finally, we compare the proposed model using a single hidden layer network with several other reference techniques. The references include various shallow neural networks and the state-of-the-art for deep neural networks [27]. Once again, we consider two types of feature vectors: the original image raster scanned into a 784-dimensional vector, and features derived using 40 PCA components, along with quadratic combinations of length 840 [26]. The classification results were obtained using a 784/ architecture with sigmoid activation functions. No pre-training was used and the network weights were randomly initialized. After training the neural network using the similar/dissimilar labels, a softmax classifier applied to the penultimate layer of units was trained for digit classification. Table 4.4 reports the classification results on the MNIST dataset and shows that the best results were achieved using PCA preprocessing and broad hidden layers. We ran the experiments 5 times with different random training/testing splits and different parameter initializations to obtain confidence intervals. The best error rate of our method is 1.93%, which is competitive with deep architectures that were trained using class labels and not using stochastic methods for regularization (dropout, denoising auto-encoder, etc). Table 4.4: Classification error rates on the MNIST dataset. The first set of methods use similar/dissimilar labels to train the neural network. The second set of methods use the class labels during training. Method Preprocessing Error Rate (Std.) Clipped L 1 cost function, softmax classifier [6] None 3.57%(0.03) Proposed method, softmax classifier None 2.60%(0.21) Proposed method, softmax classifier 40 PCA + Quadratics 1.93%(0.14) PCA+Quadratics [26] 40 PCA + Quadratics 3.30% Deep neural network [27] None 1.60% 4.5 Results on COIL-100 objects We used the COIL-100 dataset for the third experiment. The dataset [28] consisted of 7200 colour images of 100 household objects with 72 different views per object. Figure 4.4 shows some images from the dataset as examples. The dataset was obtained by taking pictures of the objects for every 5 degree of rotation on a turn-table. Illumination normalization was applied to all images. In our experiment, we used gray-scale images for the experiments and the images were resized to pixels. We focused on two different experiments. In the first experiment, following [6], we created a training set of 400 images by taking the views with angles 0, 90, 180 and 270 degrees for each of the 100 objects. The remaining 6800 images were used for testing. Similarity was determined by object class. In the second experiment, we only labelled consecutive frames as similar. To test the effect of training set size, we formulated three sub-experiments where we randomly split the data by taking 8, 16, or 32 training images from each object category. During training, mini-batches of 100 images were generated by selecting an equal number of similar and dissimilar pairs. Table 4.5 benchmarks the performance of the proposed method with methods trained directly using

34 Chapter 4. Experiments and Analysis 27 Figure 4.4: Example images (one from each category) from the COIL-100 dataset. All of the categories are centered with uniform background. the class labels, including a linear SVM [29], a k-nearest neighbour classifier, and a convolutional neural network [6]. Our method achieved a best classification rate of 70.67%, which is comparable to the stateof-the-art under the training and testing setup that we used. Although in [29] a classification rate of 78.5% was achieved using an L 2 SVM applied to the original training images, using our more stringent test setup, the L 2 SVM achieved only 67.33%. In comparison, our method shows an improvement of approximately 3%. Compared to the convolutional neural network used in [6], we used a simpler non-convolutional architecture and achieved comparable results. Furthermore, our neural network was trained using only weak similar/dissimilar labels rather than full class labels. Our method also achieved a classification rate that is slightly better than the original clipped L 1 cost, but likely without statistical significance. We questioned whether the first training condition, which was used in [6], was the most appropriate for this task. Thus, we experimented using the second training condition, where only consecutive frames were labelled as similar. As before, we also investigated the effect of different numbers of training cases while keeping the architecture unchanged. Table 4.6 shows the classification rates. Note that in the case

35 Chapter 4. Experiments and Analysis 28 Table 4.5: Classification results on the COIL-100 dataset where images from the same class were labelled as similar. The second set of methods were trained directly using the class labels. Method Accuracy (Std.) Clipped L 1 cost, L 2 SVM [6] 69.28%(0.53) Proposed method, L 2 SVM 70.67%(0.33) K-NN [6] 70.10% Linear SVM [29] 78.50% L 2 SVM 67.33% Standard CNN [6] 71.49% Table 4.6: Classification results on COIL-100 images with different numbers of training images and with consecutive frames labelled as similar. Method Numbers of training images Accuracy (Std.) Clipped L 1 [6](L 2 SVM) %(1.35) Proposed Method (L 2 SVM) %(0.68) Clipped L 1 [6](L 2 SVM) %(1.10) Proposed Method (L 2 SVM) %(1.95) Clipped L 1 [6](L 2 SVM) %(0.70) Proposed Method (L 2 SVM) %(1.20) where only 800 training images were available (8 images per category), the results were averaged over 10 runs. The proposed model shows a consistently better average performance than the method described in [6]. In [6], a semi-supervised setting was introduced for temporal coherence learning in video-like images. From consistent the results shown in Table 4.5, a potential future direction of our work is an extension to video temporal coherence. 4.6 Dimensionality Reduction and Embedding At the beginning of this chapter, we explained that one of the reasons for choosing the simple SVM and softmax classifiers for classification is to avoid relying on complex classifiers for good classification performance. To further demonstrate that the competitive results in our experiments were due to the learned hidden representations, we performed simple comparisons of 2D embeddings of the proposed model output activations (where were inputs to the simple classifiers) with those obtained under PCA and Linear Discriminant Analysis (LDA). Figure 4.8 shows a 2-dimensional embedding of the hidden representations obtained using our method with a architecture. The embeddings produced by PCA and LDA are shown in Figure 4.5 and Figure 4.6, respectively. In Figure 4.6, we see that the 2D embeddings cannot be separated linearly,

36 Chapter 4. Experiments and Analysis (a) The colour coding of MNIST digits in Figure 4.5b, Figure 4.6, Figure 4.7, and Figure (b) Embedding produces by PCA on the MNIST data Figure 4.5: 2-Dimensional embedding produced by PCA on the MNIST dataset even though they were produced using fully labelled images. PCA, as described in the previous section, does not require image labels. However, the embedding produced by using two principle components is not as good as our model. From Figure 4.8, it is clear that the proposed algorithm can learn efficient embeddings to effectively separate data using only similar/dissimilar labels. This indicates that the experimental results presented in the previous section did not rely on any specialized classifier to achieve competitive performance. To demonstrate the benefits of a probabilistic algorithm over hand-crafted costs, we performed the experiment in [13] for dimensionality reduction using our method. In [13], the authors introduced Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) with a clipped Euclidean distance similarity model. The siamese architecture used was related to [6] but did not require a supervised term in its cost function. The DrLIM algorithm consisted of a squared Euclidean distance d 2 similar model and a dissimilar model of max(0, δ d) 2, where d is the Euclidean distance of the hidden representation

37 Chapter 4. Experiments and Analysis Figure 4.6: 2-Dimensional embedding produced by LDA on the MNIST dataset of a pair of inputs and δ is a positive hand-chosen parameter. DrLIM was studied as a dimensionality reduction algorithm. In particular, experiments used the small NORB dataset [30] with one instance from the airplane class. The instance consisted of pixels airplane images on uniform backgrounds. The images were taken at 18 different azimuths, 8 different elevations, and under 6 different lighting conditions. Randomly chosen example images are shown in Figure 4.9. We followed the same approach as [13] and made a randomly split of the 972 images into 660 image for training and 312 images for testing. We performed two comparisons. First, we reproduced the experiment in DrLIM paper[13]. We provided full similarity information where similarity was defined as a pair of images taken from consecutive azimuth or elevation settings. Such similarity definition implies that the images with the same azimuth and elevation, regardless of lightings, should be mapped to the same location, i.e. lighting invariant. Following DrLIM, the 3D embeddings are presented in Figure Second, we provided chain-like similarity information, which naturally occurs in video, where the azimuth is the fastest changing variable and lighting is the slowest changing variable. An example is shown in Figure Along the chain, azimuth changes once per image, elevation changes once every 18 images, and lighting changes once every 9 elevations. We used identical set up, training/testing splits for both DrLIM algorithm and the proposed algorithm. The 3D embeddings are shown in Figure Note that all of the embeddings shown in Figure 4.11 and Figure 4.12 are from the 312 test images and each colour represents an azimuth.

38 Chapter 4. Experiments and Analysis Figure 4.7: 2-Dimensional embedding produced by a neural network model uses 1-of-N labels on the MNIST dataset. The embedding is produced using a network, and the embedding at hidden layer with 2 units is shown We used the architecture as indicated in the DrLIM paper[13]. When full similarity information is provided, in low dimensional space, the proposed algorithm works as well as DrLIM. The DrLIM architecture, however, is not suitable for video data with chain-like similarity information. This can be observed in Figure 4.12, where each colour represents an azimuth. Although both algorithms can distinguish between lighting conditions and map images with different lightings into small clusters as shown in Figure 4.12b and Figure 4.12d, the DrLIM algorithm does not show any easily identifiable structures in the embedding, whereas in the proposed algorithm they are easy to observe. Furthermore, the differences can be observed in Figure 4.13, where the colours represent elevations. In the chain-like similarity setup, lighting is the slowest changing variable and azimuth is the fastest changing variable. Thus, if we plot the 3D embeddings and highlight different elevations, we expect to see 9 ordered stripes of colours in one projection, organized in small clusters due to different lighting, and a rainbow of colours from another projection. This is observed in Figure 4.13c and Figure 4.13d, but not clearly observed in the equivalent DrLIM experiments. This improvement is likely due to the normalized probability and heavy-tailed Cauchy-like distribution in our model. Under the full similarity setting, images with a small distance in pixel space are considered to be similar (neighbours). This is a natural and simple constraint that can be imposed on many dimensionality reduction algorithms for finding embeddings. However, under the chain-like similarity setting, similarity is not closely related to the distance between pairs of images in pixel space. For example, in Figure 4.10, the first image in the bottom row (start of the example chain) and the first

39 Chapter 4. Experiments and Analysis Figure 4.8: 2-Dimensional embedding produced by proposed model on the MNIST dataset. The embedding is produced using a network, and the embedding at output layer with 2 output units is shown image in the second row from the top are close in pixel space. However, under the chain-like similarity setting, they are considered to be dissimilar. The learning task is difficult in this case, because the algorithm have to either extract the underlying representations well or be robust to outliers. Evidently, the DrLIM cost functions do not perform well in either case. Thus, from a dimensionality reduction perspective, our experiments show that the proposed probabilistic model performs better than hand-crafted models.

40 Elevations Lightings Chapter 4. Experiments and Analysis 33 Figure 4.9: Image examples from the NORB (small) dataset Figure 4.10: An example of chain-like similarity used in the experiment with NORB data

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation C.J. Norsigian Department of Bioengineering cnorsigi@eng.ucsd.edu Vishwajith Ramesh Department of Bioengineering vramesh@eng.ucsd.edu

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Efficient Algorithms may not be those we think

Efficient Algorithms may not be those we think Efficient Algorithms may not be those we think Yann LeCun, Computational and Biological Learning Lab The Courant Institute of Mathematical Sciences New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30

More information

arxiv: v2 [cs.lg] 6 Jun 2015

arxiv: v2 [cs.lg] 6 Jun 2015 HOPE (Zhang and Jiang) 1 Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Probe and Learn Neural Networks Shiliang Zhang and Hui Jiang arxiv:1502.00702v2 [cs.lg 6 Jun 2015 National

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Deep Neural Networks Optimization

Deep Neural Networks Optimization Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

SGD: Stochastic Gradient Descent

SGD: Stochastic Gradient Descent Improving SGD Hantao Zhang Deep Learning with Python Reading: http://neuralnetworksanddeeplearning.com/index.html Chapter 2 SGD: Stochastic Gradient Descent Main Idea: Given a set of input/output examples

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Pattern Classification Algorithms for Face Recognition

Pattern Classification Algorithms for Face Recognition Chapter 7 Pattern Classification Algorithms for Face Recognition 7.1 Introduction The best pattern recognizers in most instances are human beings. Yet we do not completely understand how the brain recognize

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Face Recognition using Eigenfaces SMAI Course Project

Face Recognition using Eigenfaces SMAI Course Project Face Recognition using Eigenfaces SMAI Course Project Satarupa Guha IIIT Hyderabad 201307566 satarupa.guha@research.iiit.ac.in Ayushi Dalmia IIIT Hyderabad 201307565 ayushi.dalmia@research.iiit.ac.in Abstract

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Image Compression: An Artificial Neural Network Approach

Image Compression: An Artificial Neural Network Approach Image Compression: An Artificial Neural Network Approach Anjana B 1, Mrs Shreeja R 2 1 Department of Computer Science and Engineering, Calicut University, Kuttippuram 2 Department of Computer Science and

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Convolution Neural Networks for Chinese Handwriting Recognition

Convolution Neural Networks for Chinese Handwriting Recognition Convolution Neural Networks for Chinese Handwriting Recognition Xu Chen Stanford University 450 Serra Mall, Stanford, CA 94305 xchen91@stanford.edu Abstract Convolutional neural networks have been proven

More information

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 INF 5860 Machine learning for image classification Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 Reading

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey. Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Lecture 19: Generative Adversarial Networks

Lecture 19: Generative Adversarial Networks Lecture 19: Generative Adversarial Networks Roger Grosse 1 Introduction Generative modeling is a type of machine learning where the aim is to model the distribution that a given set of data (e.g. images,

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

CS230: Deep Learning Winter Quarter 2018 Stanford University

CS230: Deep Learning Winter Quarter 2018 Stanford University : Deep Learning Winter Quarter 08 Stanford University Midterm Examination 80 minutes Problem Full Points Your Score Multiple Choice 7 Short Answers 3 Coding 7 4 Backpropagation 5 Universal Approximation

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

291 Programming Assignment #3

291 Programming Assignment #3 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

Vulnerability of machine learning models to adversarial examples

Vulnerability of machine learning models to adversarial examples Vulnerability of machine learning models to adversarial examples Petra Vidnerová Institute of Computer Science The Czech Academy of Sciences Hora Informaticae 1 Outline Introduction Works on adversarial

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

Hybrid Deep Learning for Face Verification. Yi Sun, Xiaogang Wang, Member, IEEE, and Xiaoou Tang, Fellow, IEEE

Hybrid Deep Learning for Face Verification. Yi Sun, Xiaogang Wang, Member, IEEE, and Xiaoou Tang, Fellow, IEEE 1 Hybrid Deep Learning for Face Verification Yi Sun, Xiaogang Wang, Member, IEEE, and Xiaoou Tang, Fellow, IEEE Abstract This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann

More information

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 December 2012

More information

The Detection of Faces in Color Images: EE368 Project Report

The Detection of Faces in Color Images: EE368 Project Report The Detection of Faces in Color Images: EE368 Project Report Angela Chau, Ezinne Oji, Jeff Walters Dept. of Electrical Engineering Stanford University Stanford, CA 9435 angichau,ezinne,jwalt@stanford.edu

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick Neural Networks Theory And Practice Marco Del Vecchio marco@delvecchiomarco.com Warwick Manufacturing Group University of Warwick 19/07/2017 Outline I 1 Introduction 2 Linear Regression Models 3 Linear

More information

Accelerating Convolutional Neural Nets. Yunming Zhang

Accelerating Convolutional Neural Nets. Yunming Zhang Accelerating Convolutional Neural Nets Yunming Zhang Focus Convolutional Neural Nets is the state of the art in classifying the images The models take days to train Difficult for the programmers to tune

More information

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Learning Discrete Representations via Information Maximizing Self-Augmented Training A. Relation to Denoising and Contractive Auto-encoders Our method is related to denoising auto-encoders (Vincent et al., 2008). Auto-encoders maximize a lower bound of mutual information (Cover & Thomas,

More information

Unsupervised Learning of Spatiotemporally Coherent Metrics

Unsupervised Learning of Spatiotemporally Coherent Metrics Unsupervised Learning of Spatiotemporally Coherent Metrics Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, Yann LeCun arxiv 2015. Presented by Jackie Chu Contributions Insight between slow feature

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional

More information

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Implicit Mixtures of Restricted Boltzmann Machines

Implicit Mixtures of Restricted Boltzmann Machines Implicit Mixtures of Restricted Boltzmann Machines Vinod Nair and Geoffrey Hinton Department of Computer Science, University of Toronto 10 King s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu

More information

Adversarial Attacks on Image Recognition*

Adversarial Attacks on Image Recognition* Adversarial Attacks on Image Recognition* Masha Itkina, Yu Wu, and Bahman Bahmani 3 Abstract This project extends the work done by Papernot et al. in [4] on adversarial attacks in image recognition. We

More information

Graph Laplacian Kernels for Object Classification from a Single Example

Graph Laplacian Kernels for Object Classification from a Single Example Graph Laplacian Kernels for Object Classification from a Single Example Hong Chang & Dit-Yan Yeung Department of Computer Science, Hong Kong University of Science and Technology {hongch,dyyeung}@cs.ust.hk

More information

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 / CX 4242 October 9, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Volume Variety Big Data Era 2 Velocity Veracity 3 Big Data are High-Dimensional Examples of High-Dimensional Data Image

More information

Edge and local feature detection - 2. Importance of edge detection in computer vision

Edge and local feature detection - 2. Importance of edge detection in computer vision Edge and local feature detection Gradient based edge detection Edge detection by function fitting Second derivative edge detectors Edge linking and the construction of the chain graph Edge and local feature

More information

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong) Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong) References: [1] http://homepages.inf.ed.ac.uk/rbf/hipr2/index.htm [2] http://www.cs.wisc.edu/~dyer/cs540/notes/vision.html

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information