Deep Belief Network for Clustering and Classification of a Continuous Data

Size: px

Start display at page:

Download "Deep Belief Network for Clustering and Classification of a Continuous Data"

Terence Flowers
5 years ago
Views:

1 Deep Belief Network for Clustering and Classification of a Continuous Data Mostafa A. SalamaI, Aboul Ella Hassanien" Aly A. Fahmy2 'Department of Computer Science, British University in Egypt, Cairo, Egypt Mostafa.salama@gmail.com 2Cairo University, Faculty of Computers and Information aboitcairo.aly.fahmy@gmail.com Abstract-Deep Belief Network (DBN) is a deep architecture that consists of a stack of Restricted Boltzmann Machines (RBM). The deep architecture has the benefit that each layer learns more complex features than layers before it. DBN and RBM could be used as a feature extraction method also used as neural network with initially learned weights. The approach proposed depends on DBN in clustering and classification of continuous input data without using back propagation in the DBN architecture. DBN should have a better a performance than the traditional neural network due the initialization of the connecting weights rather than just using random weights in NN. Each layer in DBN (RBM) depends on Contrastive Divergence method for input reconstruction which increases the performance of the network. 1. Introduction Kernel machines such as Support Vector Machines are local-kernel based approach, while non-local learning algorithms have the potential to generalize to pieces not covered by the training set. Also they are shallow architecture which are only two levels of data-dependent computational elements. This is also true of feed-forward neural networks with a single hidden layer [1]. Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. Deep architectures consist of feature detector units arranged in layers. Lower layers detect simple features and feed into higher layers, which in turn detect more complex features. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. In DBN the bottom layer is observable, and the multiple hidden layers are created by stacking multiple Restricted Boltzmann Machine RBMs on top of each other. RBM is generative model that uses a layer of binary variables to explain its input data [2]. The top RBM is of two layers that have symmetric undirected connection, this RBM is called Harmonium RBM with continuous Gaussian hidden nodes [for certain cases]. The training is unsupervised, but it produces useful features which can later be tuned by back-propagation for classification or dimensionality reduction. The three aspects of this strategy is particularly important: l-pre-training one layer at a time in a greedy way; 2-Using unsupervised learning at each layer in order to preserve information from the input; 3-Fine-tuning of the whole network with respect to the ultimate criterion of interest [1]. Recently, there is a need to adapt the unsupervised learning algorithm to the nature of the inputs [3]. The proposed approach handles continuous-valued inputs by scaling them to [0, 1] interval. Clustering and Classification methods are applied on the famous iris data (a continuous dataset) using DBN architecture, without backpropagating the error from the last layer containing the class labels. The structure of this paper is as follows: Section 2 shows the background of RBM and DBN architecture, Section 3 describes the proposed DBN approach for clustering and classification, finally section 4 shows the clustering results of continuous datasets and the classification results of the iris dataset. 2. Deep Belief Network 2.1. Restricted Boltzmann Machine RBM is an energy-based undirected generative model that uses a layer of hidden variables to model a distribution over visible variables [4]. The undirected model for the interactions between the hidden and visible variables is used to ensure that the contribution of the likelihood term to the posterior over the hidden variables is approximately factorial which greatly facilitates inference [5]. Energy-based model means that the probability distribution over the variables of interest is defmed through an energy function. It is composed from a set of observable variables V={Vj} and a set of hidden variables H={hj}, i node in the visible layer,j node in the hidden layer. It is restricted in the sense that there are no visible-visible or hidden-hidden connections. 473

2 The steps of the RBM learning algorithm can be declared as follows: Due to the conditional independence (no connection) between nodes in the same layer, the conditional distributions are: P(HIV) = f1p(hjlv) p(hj=ilv) = f(a;+r..; wijvj; p(hj=olv)=i- p(hj=ilv); (1) And P(VIH) = Ilip(v;lh) p(v;=ilh)= f(bj+r..j wijh) p(v;=olh)= I-p(v;=Ilh) (2) For a binary data vector. The function f is a sigmoid, such that a(z)= lii +e"z Then the distribution (likelihood) between hidden and visible units is defined as P(v,h)=e "E(v,h) Jr..i e "E(vi,h) E(x,h)=-h'wv-b'v-c'h, (3) Where h' is the transpose of matrix h. The average of the log likelihood with respect to the parameters is given by (4) (5) (6) Epsilon E is a parameter with a small value. The term <> model takes exponential time to compute exactly so the Contrastive Divergence (CD) approximation to the gradient is used instead [6]. Contrastive divergence is a method that depends on the approximation that is to run the sampler for a single Gibbs iteration, instead until the chain converges. In this case the term <>1 will be used such that it represents the expectation with respect to the distribution of samples from running the Gibbs sampler initialized at the data for one full step, the new update rule will be. AWij = E«v;hj>data - <v;hj>j) (7) Av; = E«V/>data - <v/>j) (8) Ah; = E( <hp> data - <h/>j) (9) AWij = E(olog p(v)/ow q )=E(<x;hj>data - <Vjhj>model) Av; = E«V/>data - <v; >model) Ah; = E( <h/> data - <h/> model) The Harmonium RBM is an RBM with Gaussian continuous hidden nodes [6]. Where f is Normal distribution function which takes the form shown in Equation (10) p(hj=hlx)= N(cj+wj-x, 1) (10) Harmonium RBM is used for a discrete output in the last layer of a deep belief network in classification Deep Belief Network Architecture The key idea behind training a deep belief network by training a sequence of RBMs is that the model parameters, e, learned by an RBM define both p(vlh, e) and the prior distribution over hidden vectors, p(hlo), so the probability of generating a visible vector, v, can be written as: p(v) =Lh p(hio)p(vlh,e) (11) After learning e, p(vlh,e) is kept while p(hle) can be replaced by a better model that is learned by treating the hidden activity vectors H={h} as the training data (visible layer) for another RBM. This replacement improves a variation lower bound on the probability of the training data under the composite model. The study in [12] proves the following three rules, I-Once the number of hidden units in the top level crosses a threshold; the performance essentially flattens at around certain accuracy. 2-The performance tends to decrease as the number of layers increases. 3-The performance increases as we train each RBM for an increasing number of iterations. In case of not using class labels and back-propagation in the DBN Architecture (unsupervised training) [7], DBN could be used as a feature extraction method for dimensionality reduction. On the other hand, when associating class labels with feature vectors, DBN is used for classification. There are two general types of DBN classifier architectures which are the Back-Propagation DBN (BP-DBN) and the Associate Memory DBN (AM DBN) [8]. For both architectures, when the number of possible classes is very large and the distribution of frequencies for different classes is far from uniform, it may sometimes be advantageous to use a different encoding for the class targets than the standard one-of-k softmax encoding Back-Propagation DBN Adds a fmal layer of variables that represent the desired outputs (k outputs) then performs a purely discriminative fine-tuning phase using back-propagation. Using back-propagation to fine-tune feature detectors that are initially learned as a generative model works much better than using back-propagation with random initial weights in the traditional neural network: Associate Memory DBN The top level RBM is trained on data obtained by concatenating the high-level representation produced by unsupervised learning with a binary label vector that contains a 1 in the location representing the correct class. In other words, the top RBM models the joint distribution of the inputs and associated target classes. When training the top layer of weights (the ones in the associative memory) the labels were provided as part of the input. The labels were represented by turning on one unit in a "softmax" group of k visible units. Softmax converts an arbitrary real-valued vector into a multinomial probability vector. It is a generalization of the sigmoid function to k outcomes. 474

3 3. Clustering and Classification using DBN The target of this study is to use undirected DBN in the classification of a continuous datasets like the iris and Abalone dataset. RBMs were originally developed using binary stochastic units for both the visible and hidden layers. The information that is available on continuousvalued data and neurons indicates that training is much slower than with binary inputs. Given that training on binary inputs itself is a fairly slow process, training on continuous inputs would have been infeasible. Previous work on continuous-valued input in RBMs had been carried out like adding noise to sigmoid units [9]. This work has scaled the input into interval of [0, 1] in clustering this input using DBN. The DBN consists of three layers of RBMs, the first RBM considers the input (scaled) as the visible layer, and the hidden layer is the visible layer of the second RBM. The hidden layer of the final RBM will be the output of DBN which is consisted of only one unit. Then the first RBM is trained by running Gipps method for 1000 iterations. The output of the ftrst RBM passed to the second RBM for training by running another 1000 iteration of Gipps method. The architecture of the DBN network is shown in figure 1. Initialize E =0.1 II E :epsilon value Read and Scale the input to the range of [0,1] into two dimensional array v[ni][nf] II NI, NF: N umber of Input, features Select the most discriminate features Initialize n=3 number of RBMs Initialize the number of hidden units for each RBM Initialize W randomly II W : weights of BDN Network 00 arrays [W], W2, W3] Define W'IIThe trained weighs resulted from the DBN 00 arrays [W'], W' 2, w' 3] call DBN_train(n, v, gn) Cluster the output, assign a class label to each cluster according to input Run objects in the testing dataset using the trained DBN, output of the DBN layer defines the object's class according to the cluster range. DBN_train(n, v, gn, W) II Train the input, The result of training is a 1 dimensional array of length equals to NI W],b1 RBM_Alg(v, E, W'], b, c, n/2, gn) o for all hidden units i: v1[k][i] = P(v1[k][i]= 1 [ v[kd = sigm(b][i] +sumj(w] [i][j]*v[k][j])) W2, b2 RBM_Alg(vl, E, W'2, b, c, n/4, gn) o for all hidden units i: v2[k][i] = P(v2[k][i]= 1 [ v1[kd = sigm(b2[i]+sum jew 2[i][j]*v 1 [k][j])) DBN of3 RBMLayers Figure 1: The architecture of the used DBN Network The steps of DBN classification approach could be summarized as follows; the first RBM receives the input from its visible nodes and model it on the hidden layer. Then the modeled input on the hidden layer is passed to the visible nodes in second layer. This modeling and passing methods continues to the last layer, which is composed of one visible node. In this approach feature selection could be an optional step that depends on the data itself. DBN Classifier Initialize gn =1000 Ilgn :Num of Gibbs methods W3, b3 RBM_Alg(v2, E, W'3, b, c, 1, gn) o for a single hidden units i: v3[k][0] = P(v3[k][0]= 1 [ v2[kd = sigm(b3[0] + sumj(w3[0][j] * v2[k][j])) return V3 lithe output of DBN network RBM Alg(v, epsilon, W, b, c, I, gn) III is the number of hidden units repeat for ng times for all hidden units i: o h[i] = P(h[i]= l[v[k D = sigm(b[i] + sum j(w[i][j] * v[o][j])) for all visible units j: o v'[k][j]=p(v[k][j] = l[h) = 475

4 sigm( c[j] + sum _i(w[i][j] * h[i])) for all hidden units i: o h'[i]=p(h[i] = 1 I v'[kd = sigm(b[i]+sumj(w[i][j] * v'[k)[j])) W += epsilon * (h * v[k] - h' * v'[kd b += epsilon * (h - h') c += epsilon * (v[o] - v[1 D return W, b, c DBN classification algorithm 4. Experimental Results and Discussion The DBN classification has been applied on the famous Iris dataset [10] of 148 object and 3 classes. The classification used 90% of the input for training, and 10% for testing. The output of the DBN network is divided into three distinct intervals/ clusters, the clusters have the following ranges cluster 1 [0, 0.264], cluster 2 [ ], cluster 3 [0.782, 1]. From the class labels of the objects, It shows that class 2, 1 and 0 corresponds to cluster 1,2 and 3 respectively. Then test the 10 % of the dataset by applied this part of the dataset on the trained DBN and find which range does the output of each object from the DBN lies in. According to the range, the class of each object is determined then compares this result with class label associated with the object. The result of testing (classification) of 10% of the input dataset is 93.3%. The result of clustering is shown in figure The result of classification of Abalone Dataset Another Dataset, Abalone dataset, has been tested as in figure 2. It is noticed that the output can divided into two different intervals each interval includes 2 classes, the first one is [0. 16, 0.44] where it includes class 8 and 9 objects. The second interval is [0.5, 1] and it includes class 6 and 7 objects. A prior action has been applied to this dataset that includes the selection of 4 features out of the 7 features and it was very effective. As it seen in figure 4 all the four classes appears to be concentrated in the same interval [0.2,0.4]. Figure 3: DBN output of 1 unit in the last hidden layer From input of Abalone Dataset After feature selection j,,!, Figure 4: DBN output of 1 unit in the last hidden layer From input of Abalone Dataset Before feature selection 5-Conclusion u u u u u U U M U 08NOtltpYt Figure 2: DBN output of I unit in the last hidden layer From input of Iris Dataset The performance of classification doesn't change by increasing the number of RBM layers or the number of Gipps method (1000). The performance of the proposed classifier has compared with respect to other classifiers used in weka software [11] is shown in table 1. DBN Dataset 80% 90% Name training training Weka Classification accuracy BN SVM MLP DT Iris Table 1: Comparison of accuracy with weka classifiers Undirected deep architecture has provided good results in dimensionality reduction; this supports the idea to use DBN for learning in artificial intelligence. For classification, two types of supervised training had been proposed for the structure of DBN, which are BP-DBN and AM-DBN. Depending on DBN without class labels, unsupervised learning, could leads to dimensionality reduction. In this study we have clarified that the unsupervised learning of DBN could lead to clustering of data and later to classification of data. This approach had been tested on two different and famous datasets which are Iris and Abalone dataset. Also the nature of these two datasets as a continuous data was another challenge that is treated be scaling them into the interval between 0 and References [1] Y. Bengio, P. Lamblin, P. Popovici and H. Larochelle, "Greedy Layer-Wise Training of Deep Networks", Advances in 476

5 Neural Information Processing Systems 19, Cambridge, MA, MIT Press, [2] G. E. Hinton, "A fast learning algorithm for deep belief nets", Neural Computation, July 2006, vol. 18(7), pp [3] H. Larochelle, Y. Bengio, J. Louradour and P. Lamblin, "Exploring Strategies for Training Deep Neural Networks", Journal of Machine Learning Research, 2009, Vol. 10, pp [4] H.Larochelle and Y.Bengio, "Classification using discriminative restricted boltzmann machines", Proceedings of the 25th international conference on Machine learning, 2008, vol. 307, pp [5] 1. Sutskever and G.E. Hinton, "Learning multilevel distributed representations for high-dimensional sequences". In Proceeding of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007, pp [6] A. K. Noulas and BJ.A. Krose, "Deep Belief Networks for Dimensionality Reduction", Belgian-Dutch Conference on Artificial Intelligence 2008, Netherland, [7] 1. Goodfellow, Q. Le, A. Saxe and A.Ng, "Measuring invariances in deep networks", Advances in Neural Information Processing Systems, 2009, vol. 22, pp [8] A. R. Mohamed, G. Dahl and G. E. Hinton, "Deep belief networks for phone recognition", NIPS 22 workshop on deep learning for speech recognition, [9] H.Chen and A.Murray, "A continuous restricted boltzmann machine with an implementable training algorithm", lee Proceedings of Vision, Image and Signal Processing, 2003, vol. 150(3), pp [10] DCI Machine Learning Repository, [11] Weka: Data Mining Software in java, [12] L. McAfee, "Document Classification using Deep Belief Nets", CS224n, Spring

Neural Networks and Deep Learning

Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,