Graphite: Iterative Generative Modeling of Graphs

Size: px

Start display at page:

Download "Graphite: Iterative Generative Modeling of Graphs"

Rudolf Watkins
5 years ago
Views:

1 Graphite: Iterative Generative Modeling of Graphs Aditya Grover* Aaron Zweig* Stefano Ermon Abstract Graphs are a fundamental abstraction for modeling relational data. However, graphs are discrete and combinatiorial in nature, and learning representations suitable for machine learning tasks poses statistical and computational challenges. In this work, we propose Graphite an algorithmic framework for unsupervised learning of representations over nodes in a graph using deep latent variable generative models. Our model is based on variational autoencoders (VAE), and differs from existing VAE frameworks for data modalities such as images, speech, and text in the use of spectral graph convolutions for parameterizing both the generative model (a.k.a. decoder) and inference model (a.k.a. encoder). The use of graph convolutions directly incorporates inductive biases specific to graphs, such as permutation invariance to node orderings and locality preference for clustering node representations, in the generative model. We demonstrate empirically that Graphite outperforms state-of-the-art approaches for representation learning over graphs for the task of link prediction on benchmark datasets. 1 Introduction Latent variable generative modeling is an effective approach for unsupervised representation learning of high-dimensional data [11]. In recent years, representations learned by latent variable models parameterized by deep neural networks have shown impressive performance on many tasks such as semi-supervised learning and structured prediction [6, 16]. However, these successes have been restricted to specific data modalities such as images and speech. In particular, current deep generative models cannot be directly applied to graph-structured data. Graphs arise in a wide variety of domains in physical sciences, information sciences, and social sciences, and consequently algorithms for unsupervised representation learning of such data have several downstream applications, including node classification, link prediction, and community detection [3, 10, 4]. In recent works, the notion of spectral graph convolutions has been the guiding principle for designing neural network architectures that can directly operate on graphs [1, 2]. A spectral graph convolution is defined as the multiplication of a signal (i.e., feature matrix associated with the nodes) with a parameterized filter, in the Fourier space of a graph. Graph convolution networks (GCN), in particular, efficiently compute local first-order approximations to spectral graph convolutions, and have been successfully applied across several graph mining tasks such as semi-supervised learning and relational learning [9]. Such tasks involve encoding an input graph to a final output representation (such as the labels associated with the nodes) with potentially more intermediate layers. The inverse problem of learning to decode a hidden representation into a graph, as in the case of a latent variable generative model, is to the best of our knowledge largely an open question that we address in this work. We propose Graphite, a framework for iteratively learning latent variable generative models of graphs. Specifically, we learn a directed model expressing a joint distribution P θ (A, Z) over the adjacency matrix of a graph A {0, 1} n n and a latent feature matrix Z R n k such that every row corresponds to a feature vector for a node in the graph. The multi-layer iterative decoding *Equal Contribution. Second workshop on Bayesian Deep Learning (NIPS 2017), Long Beach, CA, USA.

2 process for specifying the conditional distribution P θ (A Z) first constructs an intermediate graph using an inner-product operation followed by a sequence of graph convolutional layers on this intermediate graph. We learn the model parameters θ by maximizing a variational lower bound to the log-likelihood assigned by the model to the observed graph. Our empirical evaluations show that Graphite outperforms competing algorithms for the task of link prediction on benchmark datasets. 2 Learning framework We first define some notation and give a brief background on spectral graph convolutions before presenting Graphite. Consider an undirected graph G = (V, E) where V and E denote the set of nodes and edges respectively. We represent the graph using an adjacency matrix A {0, 1} n n where n = V and diagonal entries A ii = 1. Additionally, we denote the feature matrix associated with the graph as X R n m for an m-dimensional signal associated with each node in the graph. If there are no explicit node features, we set X = I n (identity). Let D be the diagonal degree matrix such that D ii = (i,j) E A ij and hence the graph Laplacian L = D A. We define Ã to be the symmetric normalization of A given by Ã = D 1/2 AD 1/2. The Fourier basis of a graph is given by the eigenvector matrix of the graph Laplacian. A spectral graph convolution convolves the feature matrix X with a parameterized filter F θ = diag(θ) (where θ R n are the learned parameters), defined as a matrix multiplication in the Fourier basis of the graph. Formally, we define the graph convolution operation between F θ and X for the graph A as: F θ X = UF θ U T X where U is the eigenvector matrix for the graph Laplacian, i.e., L = UΛU T where Λ is the diagonal matrix of eigenvalues. Since eigendecomposition of the graph Laplacian is computationally prohibitive, [9] propose a local first-order approximation: F θ X D 1/2 AD 1/2 X. Extending the above approximation to a neural network design with several filters applied in each layer, we get the layerwise propagation rule for l 1 a graph convolutional network (GCN): H (l) = η(d 1/2 AD 1/2 H (l 1) Θ (l) ) where H (l) and H (l 1) are the l-th and l 1-th layers of the GCN with the base case H (0) = X, Θ (l) is a matrix of learnable parameters for the current layer, and η is a suitable activation function. For brevity throughout the paper, we will denote the propagation rule as: H (l) = GCN η,l (A, H (l 1) ). An alternate interpretation of a GCN layer is performing message passing over a graph using the Weisfeiler-Lehman (WL) algorithm [18]. Intuitively, every node in the graph passes messages or information about its local structure to its neighbors. For a single GCN layer, the messages are passed to neighbors at a 1-hop distance from the node. In a multi-layer GCN with L-layers, information is propagated to neighbors up till an L-hop distance from the source. Refer to [9] for more details. 2.1 Graphite Variational Autoencoder We use upper-case symbols to denote probability distributions and assume they all admit absolutely continuous densities (denoted by the corresponding lower-case notation) on a suitable reference measure. We are interested in learning a latent variable model that specifies a joint distribution P θ (A, Z X) conditioned on the feature matrix X of a graph A and latent variable matrix Z R n k where every row corresponds to a latent vector for a node in the graph. Similar to a variational autoencoder [7], our learning objective maximizes an evidence lower bound (ELBO) to the loglikelihood of the observed graph A: [ log p θ (A X) E qφ (Z A,X) log p ] θ(a, Z X) q φ (Z A, X) where q φ (Z A, X) is a variational approximation to the true posterior p θ (Z A, X), parameterized by φ. We specify an isotropic standard Gaussian prior over Z such that the joint distribution factorizes as P θ (A, Z X) = P θ (A Z, X)P (Z). The observation model P θ (A Z, X) and the variational posterior Q φ (Z A, X), also referred to as the encoders and decoders respectively, are specified using deep neural networks. In particular, we use graph convolutional networks as described below. 2

3 Table 1: Area Under the ROC Curve (AUC) scores for link prediction (* denotes dataset with features) PPI Cora Citeseer Pubmed Cora* Citeseer* Pubmed* SC 84.2 ± ± ± ± DW 68.2 ± ± ± ± GAE 88.8 ± ± ± ± ± ± ± 0.04 VGAE 89.5 ± ± ± ± ± ± ± 0.13 Graphite-AE 91.1 ± ± ± ± ± ± ± 0.03 Graphite-VAE 91.2 ± ± ± ± ± ± ± 0.04 Table 2: Average Precision (AP) scores for link prediction (* denotes dataset with features) PPI Cora Citeseer Pubmed Cora* Citeseer* Pubmed* SC 88.9 ± ± ± ± DW 69.0 ± ± ± ± GAE 89.4 ± ± ± ± ± ± ± 0.04 VGAE 89.6 ± ± ± ± ± ± ± 0.12 Graphite-AE 92.1 ± ± ± ± ± ± ± 0.03 Graphite-VAE 92.2 ± ± ± ± ± ± ± 0.04 Encoder. The encoder Q φ (Z A, X) is a factorized multivariate Gaussian distribution specified using a multi-layer GCN, referred as EGCN (Encoder-GCN). In our experiments, we use two layers for the encoder and a multivariate Gaussian posterior with diagonal covariance. Hence, the forward pass is given by: H (1) e = EGCN η1,1(a, X) µ q, log σ q = EGCN η2,2(a, H (1) e ) where η 1 is any activation function and η 2 is set to identity (no activation) to search the full range of possible parameters for the Gaussian posterior. Decoder. The decoder P θ (A Z, X) is a factorized Bernoulli distribution also specified using a multi-layer GCN, referred as a DGCN (Decoder-GCN). However, we do not have access to an explicit graph to perform graph convolutions unlike an EGCN that could directly access A. We resolve this shortcoming by generating intermediate graphs using a similarity matrix that is a function of the latent feature matrix Z Q φ (Z A, X). The forward pass for the two-layer decoder is specified as: H (1) d Â = σ(zz T ) = DGCN η1,1(â, [Z X]) Z = DGCN η2,2(â, H(1) d ) where η 1 is any suitable activation function and η 2 is the identity (such that Z and Z can take the same range of values) and the vertical bars denote concatenation of the original feature matrix X with Z. For the final step of the decoding, we obtain the desired conditional distribution p θ (A Z, X) with an inner product over a feature matrix specified as the convex combination of Z and Z. Z = λz + (1 λ)z p θ (A Z, X) = Π n i=1π n j=1p θ (A ij Z, X) where p θ (A ij Z, X) = σ(z iz j). Here, Z i denotes the i-th row of Z and λ [0, 1] is a tunable hyperparameter. 2.2 Graphite Autoencoder Similar to the differences between a standard autoencoder (AE) and a variational autoencoder, the EGCN here directly represents Z (instead of a variational posterior). Additionally, the learning objective minimizes the cross-entropy between the reconstructed graph probabilities and true graphs. 3

4 (a) Graphite-AE (b) Graphite-VAE Figure 1: t-sne embeddings of the latent feature vectors for the Cora dataset. Colors denote labels. 3 Experimental Evaluation We consider the task of link prediction [11]. Even though Graphite learns a distribution over graphs, it can be used for predictive tasks within a single graph. In the case of link prediction, we learn a model for a random, connected training subgraph of the true graph. For validation and testing, we add positive and negative (false) edges to the original graph and evaluate the performance of the model based on the reconstruction probabilities assigned to the validation and test edges (similar to denoising of the input graph). In our experiments, we held out a set of 5% edges for validation, 10% edges for testing, and train all models on the remaining subgraph. We evaluate performance based on the Area Under the ROC Curve (AUC) and Average Precision (AP) metrics. Additionally, the validation and testing sets also each contain an equal number of non-edges. We compared across four standard benchmark datasets: a Protein-Protein Interaction (PPI) network for Homo Sapiens [17, 5] with proteins as nodes and interactions as edges, and three citation networks Cora, Citeseer, and Pubmed with papers as nodes and citations as edges [15]. For the citation networks, the text in the papers can be synthesized into optional node features. For Graphite-AE and Graphite-VAE, we used an architecture of units for the encoder and units for the decoder trained using the Adam optimizer [7] with a learning rate of The dropout rate (for edges) and λ were tuned as hyperparameters on the validation set to optimize the AUC. Additionally, we trained every model for 500 iterations and used the model checkpoint with the best validation loss for testing. We evaluated Graphite-VAE and Graphite-AE against competing methods for representation learning on graphs. The baselines we consider are Spectral Clustering (SC) [1], DeepWalk (DW) [14], Graph Autoencoder [8], and Variational Graph Autoencoder [8]. We used the SC implementation from [13] and public implementations for others made available by the authors. For SC, we used a dimension size of 128. For DW which uses a skipgram like objective on random walks from the graph, we used the same dimension size and default settings used in the paper of 10 random walks of length 80 per node and a context size of 10. For GAE and VGAE, we used the same architecture as VGAE and Adam optimizer with learning rate of Note that SC and DW do not provide the ability to incorporate node features while learning embeddings, and hence we evaluate them only on the featureless datasets. GAE and VGAE are a special case of Graphite, using only a single inner-product decoder (i.e., λ = 1). The AUC and AP results (along with standard errors) are shown in Table 1 and Table 2 respectively averaged over 50 random train/validation/test splits. On both metrics, Graphite-VAE gives the best performance overall. Graphite-AE also gives good results, generally outperforming its closest competitor GAE. We further visualize the embeddings generated by a 2D t-sne projection [12] of the latent feature vectors (given as rows for Z with λ = 0.5) on the Cora dataset in Figure 1. Even without any access to label information for the nodes during training, the Graphite models are able to cluster the nodes (papers) as per their labels (paper categories). 4

5 References [1] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In ICLR, [2] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, [3] D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press, [4] S. Fortunato. Community detection in graphs. Physics reports, 486(3):75 174, [5] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD, [6] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In NIPS, [7] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, [8] T. N. Kipf and M. Welling. Variational graph auto-encoders. arxiv preprint arxiv: , [9] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, [10] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. journal of the Association for Information Science and Technology, 58(7): , [11] J. C. Loehlin. Latent variable models: An introduction to factor, path, and structural analysis. Lawrence Erlbaum Associates Publishers, [12] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov): , [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct): , [14] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD, [15] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93 106, [16] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS, [17] C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers. Biogrid: A general repository for interaction datasets. Nucleic Acids Research, 34: , [18] B. Weisfeiler and A. Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12 16,

Deep Learning on Graphs

Deep Learning on Graphs with Graph Convolutional Networks Hidden layer Hidden layer Input Output ReLU ReLU, 6 April 2017 joint work with Max Welling (University of Amsterdam) The success story of deep