Hierarchical Representation Using NMF

Size: px

Start display at page:

Download "Hierarchical Representation Using NMF"

Nelson O’Brien’
5 years ago
Views:

1 Hierarchical Representation Using NMF Hyun Ah Song 1, and Soo-Young Lee 1,2 1 Department of Electrical Engineering, KAIST, Daejeon , Republic of Korea 2 Department of Bio and Brain Engineering, KAIST, Daejeon , Republic of Korea hi.hyunah@gmail.com, sylee@kaist.ac.kr Abstract. In this paper, we propose a representation model that demonstrates hierarchical feature learning using nsnmf. We stack simple unit algorithm into several layers to take step-by-step approach in learning. By utilizing NMF as unit algorithm, our proposed network provides intuitive understanding of the feature development process. It is able to represent the underlying structure of feature hierarchies present in complex data in intuitively understandable manner. Experiments with document data successfully discovered feature hierarchies of concepts in data. We also observed that proposed method results in much better classification and reconstruction performance, especially for small number of features. Keywords: Hierarchical representation, NMF, unsupervised feature learning, multi-layer, deep learning. 1 Introduction Humans are efficient learning machines. We can easily extract features from complex data, and understand them. How do we do this? We take hierarchical feature extraction strategy. By breaking a complex problem into several simple problems, we solve one by one throughout multiple stages [2]. By integrating simple solutions throughout the layers, algorithm is able to solve complex problem even without involving complex mathematical functions. Visual cortex supports this hierarchical information processing mechanism well [6]. With these biological evidences, researchers have been paying attention to hierarchical feature extracting approaches. One best known algorithm is Deep Belief Network (DBN) introduced in 2006 [5] where Hinton showed first success in training deep architectures of Restricted Boltzmann Machines (RBMs) in greedy layer-wise manner. With the success of training deep architectures, several variants of deep learning have been introduced; auto-encoders stacked into several layers [3, 8], stacking NMF into several layers [1, 4, 10]. Although these multi-layered algorithms take hierarchical approaches in feature extraction and provide efficient solution to complex problems, they do not Hyun Ah Song is now at KIST (Korea Institute of Science and Technology). M. Lee et al. (Eds.): ICONIP 2013, Part I, LNCS 8226, pp , c Springer-Verlag Berlin Heidelberg 2013

2 Hierarchical Representation Using NMF 467 provide us intuitive relationships of features in form of hierarchies that are learned throughout the hierarchical structure; Most of deep learning networks allow both addition and subtraction of features in the hierarchical learning process, and this results in intricate representation of feature development process that is quite hard to follow intuitively. In this paper, we propose a hierarchical data representation model, hierarchical multi-layer non-negative matrix factorization. (This is extended version of [11].) We extend a variant of NMF algorithm [7], nsnmf [9] into several layers for hierarchical learning. By stacking algorithm that restricts non-negativity, we allow only additional operation of features in process of developing feature hierarchies. We demonstrate intuitive feature development process along the layers, and display hierarchies present in the data set by learning relationships between features across the layers. We also prove that instead of one step learning, hierarchical approach learns more meaningful and helpful features, which leads to better distributed representations, and results in better performance in classification and reconstruction for small number of features. The organization of the paper is as follows. In Section 2, we introduce unit algorithm of our hierarchical network, nsnmf. Then we look into the structure of our proposed network in Section 3. We explain the intuitive understanding of our hierarchical feature extraction process in Section 4. We demonstrate the experimental result of our proposed network using Reuters document data set in Section 5, and close our paper in Section 6. 2 Non-smooth Non-negative Matrix Factorization (nsnmf) Proposed network is constructed by stacking nsnmf [9] into several layers. Nonsmooth non-negative matrix factorization (nsnmf) is a variant of NMF that restricts sparsity constraint. Basic NMF decomposes non-negative input data X into non-negative W and H, which are features and corresponding coefficients or data representation respectively. It aims to reduce error between original data X and its reconstruction WH: C = 1 2 X WH 2 = 1 m n 2 i=1 j=1 (X ij f k=1 W H ) 2. To apply sparsity constraint to standard NMF, a sparsity matrix S is introduced in [9]: S =(1 θ)i(k) + θ k ones(k). kis number of features, and θ is parameter for smoothing effect, in range of 0 to 1. I(k) is identity matrix of size kxk,andones(k) is a matrix of size k x k with all components of 1s. We smooth a matrix by multiplying it with S. The closer θ is to 1, more smoothing effect is applied. During alternative update, we smooth H matrix by multiplying S and H during iterations as H=SH. To compensate the loss of sparsity by smoothing, W becomes sparse. 3 Multi-layer Architecture The proposed hierarchical multi-layer NMF structure comprise of several layers of unit algorithm. The structure is described in Fig. 1.

3 468 H.A. Song and S.-Y. Lee Fig. 1. Overall architecture of hierarchical multi-layer NMF network We first train each layer separately. ) We process outcome of each layer H (l) to get K (l) as K (l) = f,wherem (l) f k = =1 H(l) k j. f( ) is nonlinear ( H (l) M (l) function, and the superscript of each term denotes layer index, where l denotes index of layer, l =1, 2,...L. Processed data representation of K (l) is used as input to next layer: Using nsnmf, K (l) is decomposed into W (l+1) and H (l+1) : K (l) W (l+1) H (l+1). We repeat this process by extending layers for l =1, 2,...L. After training of layers 1 to L separately, we use outcome of separate training as initialization, and train the whole network jointly. The cost function for joint training is C = 1 m n 2 i=1 j=1 (X ij f k=1 W (1) H (1) )2 (l),where H is the reconstruction of H (l), which can be computed via back propagation of errors from the last layer to the l th layer. Computation can be described in simpler form as similar to [1] in equation (1). (Nu (l) H ) ( (l)t W (l)t Nu (l)) W (l) W (l) (De (l) H ),andh (l) H(l) ( W (l)t (l)t De (l)),where (1a) { X if l = 1 Nu (l) = ( W (l 1)T Nu (l 1)) ( (M (l 1) f 1 W (l) H (l))) otherwise (1b) { X if l = 1 De (l) = ( W (l 1)T De (l 1)) ( (M (l 1) f 1 W (l) H (l))) otherwise (1c) f Here, X = W (1) H (1),andcanbecomputedasshownin(2). H (l) H (l) ( ) if l = L = M (l) f 1 W (l+1) H (l+1) if l = L 1,..., 1 (2)

4 Hierarchical Representation Using NMF 469 M (l) is a matrix of column-wise mean of H (l),andf 1 ( ) is inverse nonlinear function. After training until the last layer, final data representation H (L) is acquired. This is the activation information of complex features, which is the integration of features throughout the layers, W (1) W (2)...W (L). 4 Intuition of Hierarchical NMF Feature Learning in Image In this section, we provide intuitive understanding of hierarchical feature learning of our proposed network. The hierarchical feature learning displays what kind of features develop at each layer, and how features from the lower layer are combined together to form higher layer features. Other deep learning algorithms also learn hierarchical features that are combination of lower layer features. However, since they do not restrict the sign of the values to be positive, combination of features involves subtraction of features as well, and this yields feature development process hard to follow, and representation of features are not intuitively understood. In contrast, hierarchical NMF provides intuitive understanding of feature hierarchies by allowing only weighted summations among the features during the hierarchical learning process. This can be interpreted as simply adding lego blocks to construct a complete structure. In order to help better understanding, we demonstrate the feature learning process using image data, MNIST digit data 1. In Fig. 2, the construction process of data using learned features is described in feature hierarchy structure. First layer learns very simple spot-le features W (1) which can be seen as the basic building blocks. In the second layer W (2), combinations of these first layer features are learned in distributed pattern. Integration through two layers produces complex blocks by combining W (1) according to W (2). We can intuitively follow the process by building up features in weighted summation manner due to non-negativity constraint which allows simple add-ons. The combination of complex feature is again combined to represent an original data. As explained in above demonstration, our proposed hierarchical network learns features as the building blocks of data, and provides intuitive hierarchical process, discovering feature hierarchies present in the complex data. 5 Experiment with Document Data We applied our proposed network to document database. We used Reuters collection, distribution as input data. We sorted top 10 categories from ModApte split, and reduced dimension to 1000 by removing stop-words. 1 Available at: 2 The Reuters-21578, Distribution 1.0 test collection is available from David D. Lewis professional home page, currently:

470 H.A. Song and S.-Y. Lee Fig. 2. Feature hierarchy of MNIST database. Images in order of W (1) W (2) X, from bottom to top. W (1) There are 5786 training samples and 2587 test samples.

3, an example of top 10 words in learned high level features via integration of two layers, W (1) W (2), is displayed.

5 470 H.A. Song and S.-Y. Lee Fig. 2. Feature hierarchy of MNIST database. Images in order of W (1) W (2) X, from bottom to top. W (1) There are 5786 training samples and 2587 test samples. We constructed twolayered network with number of hidden neurons as 160 ( a, where a denotes dimension of final data representation). 5.1 Feature Hierarchies In Fig. 3, an example of top 10 words in learned high level features via integration of two layers, W (1) W (2), is displayed. Simple observation reveals which topic each feature represents: grain, money, crude, interest, coffe, trade, earn, acq, grain, and ship. Fig. 3. An example of W (1) W (2) As in Fig. 2, features shown in Fig. 3 are part of hierarchy of concepts in Reuters. An example of how concepts form hierarchies in Reuters is shown in Fig. 4. In Fig. 4 (a), three first layer features W (1) s are weighted summed to form second layer feature W (2). By observing words in each feature, we see that lower layer features cover small scope of the topic, containing various words. However, when they proceed to higher layer, they converge to represent one big common concept of oil, with their top four words being synonyms of oil. BasedonFig.4(a),wecanconstructaconcept hierarchy in Reuters as shown in Fig. 4 (b). By hierarchical concept diagram, we can observe that big broad topic oil (words indicated in red color) contains various other oil related words that are colored in blue in the low level. Also, we can observe how some words

6 Hierarchical Representation Using NMF 471 (a) (b) Fig. 4. Concept hierarchy in Reuters. (a) Experimental results, and (b) diagram of result in (a). are extracted together to comprise a feature to show their relations with each other in first layer: words exploration, ecuador, pipeline, well, saudi form a distinct group, same for texas, increase, purchase, contract, effective, and mln, stocks, fell, rose, reserves, refinery ; this grouping information can be interpreted as sub-topics under the topic oil. If we used single layered network, all we could have observed would be those red words that indicate topic oil. However, by hierarchical representation, we can observe deeper into the data structure more in detail by showing contents of blue words, and their groupings in low level. 5.2 Classification and Reconstruction Property The classification and reconstruction property for various a (dimension for H (L) ) is shown in Fig. 5 (a) and (b), respectively. For classifier, we used SVM. We calculated reconstruction error as: Mean reconstruction error = m n i=1 j=1 Xij X ij mn. The proposed hierarchical feature extraction method results in much better classification and reconstruction, especially for small number of features, compared to standard network that consists of only one layer. Even at dimension of 20, our proposed network displays the maximum performance it can reach after convergence. This supports that taking step-by-step approach by learning features in hierarchical manner provides better chance of learning meaningful features out of complex data; first layer pre-processes complex data by breaking it down into small units, lessening the burden for the second layer so that second layer just needs to learn how to combine these units. In Fig. 6, we show two examples of reconstruction, the same word with the same color. In the first example, the reconstruction via our proposed network (c) returns most of significant words present in the original data (a). Also, it successfully learned the importance of words, by displaying word sequence similar to the original. In contrast, the standard network (b) misses key words and fails to capture the importance of the words, showing words in mixed order compared to the original. The second example shows similar result with the first example. In (e), single layer feature confuses the subject of the topic by containing words

7 472 H.A. Song and S.-Y. Lee (a) (b) Fig. 5. (a) Reuters data classification, and (b) Reuters data reconstruction (a) (b) (c) (d) (e) (f) Fig. 6. Two examples of original data, and reconstruction by standard (single-layerd) and proposed network in (a), (b), (c), and (d), (e), (f), respectively wheat, oil, corn, and sugar altogether. However, our proposed network (f) correctly extracted key words related to subject oil. 6 Conclusion In this paper, we proposed a hierarchical representation model of nsnmf, by taking step-by-step approach in learning of the features in complex data. Our proposed network discovers feature hierarchies present in complex data and demonstrate them in intuitively understandable manner by learning feature relationships among the layers in non-negative approach. By simple addition and accumulation of features, we are able to understand the data structure and construct a hierarchy based on the information learned by the network. We also show that our proposed network provides better performance in classification and reconstruction compared to the single-layered network for small number of dimensions provided for final data representation. As a further work, we would le to apply our proposed network to various types of data for discovering underlying feature hierarchies in complex data.

8 Hierarchical Representation Using NMF 473 Acknowledgments. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning ( and ). References 1. Ahn, J.-H., Choi, S., Oh, J.-H.: A multiplicative up-propagation algorithm. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 3. ACM (2004) 2. Bengio, Y.: Learning deep architectures for ai. Foundations and Trends in Machine Learning 2(1), (2009) 3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19, 153 (2007) 4. Cichocki, A., Zdunek, R.: Multilayer nonnegative matrix factorisation. Electronics Letters 42(16), (2006) 5. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18(7), (2006) 6. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat s visual cortex. The Journal of Physiology 160(1), 106 (1962) 7. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), (1999) 8. Marcrquote Aurelio Ranzato, C.P., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. Advances in neural information processing systems 19, (2007) 9. Pascual-Montano, A., Carazo, J.M., Kochi, K., Lehmann, D., Pascual-Marqui, R.D.: Nonsmooth nonnegative matrix factorization (nsnmf). IEEE Transactions on Pattern Analysis and Machine Intelligence 28(3), (2006) 10. Rebhan, S., Eggert, J.P., Gross, H.-M., Körner, E.: Sparse and transformationinvariant hierarchical NMF. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN LNCS, vol. 4668, pp Springer, Heidelberg (2007) 11. Song, H.A., Lee, S.Y.: Hierarchical data representation model-multi-layer nmf. arxiv preprint arxiv: (2013)

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model Marc'Aurelio Ranzato C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun CIAR Summer School Toronto 2006 Why Extracting