Latent Variable Models for the Analysis, Visualization and Prediction of Network and Nodal Attribute Data. Isabella Gollini.

Size: px

Start display at page:

Download "Latent Variable Models for the Analysis, Visualization and Prediction of Network and Nodal Attribute Data. Isabella Gollini."

Alison Spencer
5 years ago
Views:

1 z i! z j Latent Variable Models for the Analysis, Visualization and Prediction of etwork and odal Attribute Data School of Engineering University of Bristol isabella.gollini@bristol.ac.uk January 4th, 4 Joint work with Prof. Brendan Murphy (University College Dublin) About Me With Jonty: Probabilistic methods for uncertainty assessment and quantification in natural hazards (floods, volcanoes, and earthquakes etc.). Models to cluster binary data with complex dependence structure Gollini, I., and Murphy, T.B., () Mixture of Latent Trait Analyzers for Model-Based Clustering of Categorical Data, Statistics and Computing. Models for network data Use of Variational methods for fast approximate inference. Outline otation number of nodes of an observed network etwork Data Latent Space Models for etworks Variational Inference Factor Analysis for odal Attributes Joint Model for etwork and odal Attributes Latent Variable Models for Multiple etworks Y ( ) adjacencymatrix if there is an edge between node i and j y ij = otherwise X ( M) matrix of M nodal attributes. z n (,σ I) D dimensional continuous latent variable Latent Space Model () for etworks Hoff et al. () introduced a model that assumes that each node n has an unknown position z n in a D-dim Euclidean latent space. p(y Z,α)= p(y ij z i,z j,α)= exp(α z i z j ) yij + exp(α z i z j ) Why Squared Euclidean Distance? It requires less approximation to be made in the estimation procedure. It allows to visualize more clearly the presence of potential clusters, giving a higher probability of a link between two close nodes in the latent space and lower probability to two nodes lying far away from each other. with p(α)= (ξ,ψ ), p(z n ) iid = (,σ I D )andσ,ξ,ψ are fixed parameters d ij = z i! z j d ij = z i! z j! =! =! =! =! =! The posterior distribution cannot be calculated analytically. p(y ij = ) = exp(!! d ij) + exp(!! d ij).5 OTE: We propose to use is the Squared Euclidean Distance

2 for etworks Variational Approach We fit the model using a Variational inference approach that is considerably quicker but less accurate than MCMC. The posterior probability of the unknown (Z,α) is: p(z,α Y)=p(Y Z,α)p(α) n= where C is the unknown normalising constant p(z n ) C We propose a variational posterior q(z,α Y) introducing variational parameters ξ, ψ, z n, Σ: q(z,α Y)=q(α) n= q(z n ) where q(α)= ( ξ, ψ )andq(z n )= ( z n, Σ). Variational Approach The basic idea behind the variational approach is to find a lower bound of the log marginal likelihood logp(y) by introducing the variational posterior distribution q(z,α Y). This approach leads to minimize the Kulback-Leibler divergence between the variational posterior q(z,α Y) and the true posterior p(z,α Y): KL[q(Z,α Y) p(z,α Y)] = q(z,α Y)log p(z,α Y) = = q(z,α Y) d(z,α) p(y,z,α) q(z,α Y)log p(y)q(z,α Y) d(z,α) q(z,α Y)log p(y,z,α) d(z,α) log p(y) q(z,α Y) Variational Approach for etworks Variational Approach EM Algorithm KL[q(Z,α Y) p(z,α Y)] divergence can be written as: KL[q(Z,α Y) p(z,α Y)] = KL[q(α) p(α)] + i= KL[q(z i ) p(z i )] E q(z,α Y) [log(p(y Z,α))] E q(z,α Y) [log(p(y Z,α))] is approximated using the Jensen s inequality: E q(z,α Y) [log(p(y Z,α))] = y ij E q(z,α Y) [α z i z j ] E q(z,α Y) [log( + exp(α z i z j ))] y ij (E q(z,α Y) [α z i z j ]) log( + E q(z,α Y) [exp(α z i z j )]) The EM algorithm at each (i + )th iteration: E-Step Estimate z (i+) n and Σ (i+) : Q(Θ ;Θ )=KL[q(Z,α Y) p(z,α Y)] where Θ =( ξ, ψ ). M-Step Estimate ξ and ψ : Θ (i+) = argmax Q(Θ ;Θ ) Monks etwork Comparison of Estimation Methods and Distance Metrics Sampson (969) recorded the social interactions among a group of = 8 monks while being a resident in a ew England monastery. The directed links of the network represent the liking relationships. z n Sampson Data z n MCMC - Euclidean distance (latentnet) z n Variational - Euclidean distance (VBLPCM) z n Variational - Squared Euclidean distance z n z n z n - - z n

3 Variational Inference VS MCMC etwork and odal Attributes Closed form posteriors Far faster than MCMC based methods In the absence of posterior dependence, the lower bound would match the log likelihood. As long as the posterior dependence is weak, the VA may be useful: Larger networks For starting point of MCMC algorithms To explore the model space. Underestimates variances Difficult to assess how tight the lower bound is. Sensitive to starting values (local minima) The classical approach to incorporate nodal attributes in the is: exp(α + β p(y Z,α,X,β)= T x ij z i z j ) y ij +exp(α + β T x ij z i z j ) β and x ij are vectors of length M. This contains only link covariate information x ij so it is not designed to deal with nodal attributes directly. This model assumes that the probability of a link depends on the nodal attributes (social selection) Sometimes the nodal attributes depend on the network links (social influence). We present a model where the network and the nodal attributes data mutually depend on each other. Factor Analysis (FA) for odal Attributes Factor Analysis (FA) for odal Attributes Factor analysis (FA) (Spearman, 94) is a useful technique to visualize continuous data, reducing the data dimensionality from M to D (where D M) inordertoexplainthe variability expressed by the correlation within the data. FA assumes that there is a continuous latent variable z n underlying the behavior of the continuous response variables given by an observation x n. z n iid (,σ I), and ε n iid (,Ψ), where Ψ = diag(ψ,...,ψ M ), and So, x n = µ + Λz n + ε n p(x n z n ) (µ,(λσ)(λσ) T + Ψ) The EM algorithm is used to find maximum likelihood estimate. p(z n x n ) (ẑ n, ˆΣ) Everything can be calculated analytically in closed form. The joint model for etwork and odal Attributes The probability of a node being connected with other nodes and the behaviour of nodal attributes are explained by the same latent variable. A continuous latent variable z n (,σ I D )summarizesthe information given by both the network and the nodal attributes. etwork Y and nodal attributes x n are independent given the latent variable z n. FA X Ψ Λ Z α Y The joint model for etwork and odal Attributes Fit the model We assume that: p(z n ) (,σ I). The network data are modeled via : p(z n Y) ( z n, Σ). The nodal attributes are modeled via FA: p(z n x n ) (ẑ n, ˆΣ). Joint model: where Σ = p(z n Y,x n ) p(z n Y)p(z n x n ) p(z n ) ( z n, Σ) Σ + ˆΣ σ I D and z n = Σ Σ zn + ˆΣ ẑ n

The joint model for etwork and odal Attributes EM Algorithm E-Step Estimate Σ (i+) The joint model for etwork and odal Attributes Yeast Proteins (i+) The network is composed by =5 nodes representing

4 The joint model for etwork and odal Attributes EM Algorithm E-Step Estimate Σ (i+) The joint model for etwork and odal Attributes Yeast Proteins (i+) The network is composed by =5 nodes representing the interaction between Saccharomyces cerevisiae (yeast) proteins. and z n Q(Θ,ΘFA ; Θ, ΘFA ) = = Ep(Z Y,X;Θ,ΘFA ) + Ep(Z Y,X;Θ The nodal attributes consist of expression levels during yeast sporulation from M=8 experiments with Saccharomyces cerevisiae proteins. [log(p(y, Z Θ ))] +,ΘFA ) [log(p(x, Z ΘFA ))] therefore, (i+) (i+) = [Σ ] + [Σ ] ID σ (i+) (i+) (i+) (i+) (i+) (i+) z n = Σ [Σ ] z n + [Σ ] z n Σ Factor analysis is an appropriate tool to visualize the M-dimensional expression data in a low dimensional latent space. (i+) The fixed parameters: zn (, I ) ξ = ψ = M-Step Update (i+) (i+) (Θ, ΘFA ) = argmax Q(Θ, ΘFA ; Θ, ΘFA ) The joint model for etwork and odal Attributes Results The joint model for etwork and odal Attributes Performance.. ROC curves. AUC zjm = 9 % AUC ~ z JM = 9 % AUC ~ z = 9 %... ROC curve (left) and Boxplot (right) of the estimated probabilities of a link for the true negatives and true positives. positions (left) and LSJM positions (right). Multiple etwork Views Joint Modelling of Multiple etwork Views In many applications the behaviour of the nodes is strongly shaped by the complex relation of many interactions. Longitudinal networks: the links represent the same relation at diﬀerent time points. Multiplex networks: the links come from diﬀerent kind of relations (eg genetic and physical etc.) We have K networks on the same nodes. We propose a model that merges the information given by all these networks. A continuous latent variable zn (, σ ID ) identifies the position of node n in a D-dimensional latent space. Z Y Y Y

5 Joint Modelling of Multiple etwork Views Model The probability of a link depends on the distance between two nodes in the latent space. p(y,...,y K Z,α)= K k= exp(α k z i z j ) yijk + exp(α k z i z j ) Variational Approach k =,...,K: p(z n Y k ) ( z nk, Σ k ). Joining the two models: where Σ = p(z n Y,...,Y K ;Θ,...,Θ K ) K k= p(z n Y k ;Θ k ) p(z n ) K K k= Σ k K σ I D ( z n, Σ) and z n = Σ K k= Σ k z nk Joint Modelling of Multiple etwork Views EM Algorithm E-Step Estimate the parameters of the joint posterior distribution Σ (i+) and z (i+) n : Q(Θ,...,Θ K ;Θ,...,Θ K )= = K k= E p(z Y,...,Y K ;Θ,...,Θ K )[log(p(y k,z Θ k ))] We estimate the parameters z nk, Σ k of the posterior distribution p(z n Y k ;Θ k ) given each network k separately. We merge these estimates to find joint posterior distribution of the latent positions ( z n, Σ). M-Step Update the variational model parameters ξ,..., ξ K, ψ,..., ψ K : (Θ (i+),...,θ (i+) K )=argmax Q(Θ,...,Θ K ;Θ,...,Θ K ). LSJM Monks etwork (cont d) LSJM Monks etwork LSJM positions z n - - We analyze the networks of liking relationship at K =time points fitting the to each network separately. - Sampson Data z n Sampson Data z n Sampson Data z n - - LSJM - Sampson Data z n z n z n z n LSJM Protein-Protein Interactions LSJM Protein-Protein Interactions positions Genetic Interactions Physical Interactions K = undirected networks formed by genetic and physical protein-protein interactions between = 67 Saccharomyces cerevisiae proteins. The complex relational structure of this dataset has led to implementation of models aiming at describing the functional relationships between the observations. The data were downloaded from the Biological General Repository for Interaction Datasets (BioGRID) database. z n z n z n Latent posterior distributions fitting the for the two networks separately. z n

6 LSJM Protein-Protein Interactions ROC and BOX plots LSJM Protein-Protein Interactions LSJM positions! LSJM: Genetic! Physical LSJM Physical Interactions zn zn. p (y i j = Model). yij =. yij = yij = yij = AUC Genetic Interactions = 95 % AUC Physical Interactions = 95 %.... p (y i j = Model).... Genetic Interactions ROC curve ROC curves and Boxplots of the estimated probabilities of a link for the true negatives and true positives zn zn Left: p(zn Y,..., YK ; Θ,..., ΘK ) (z n, Σ ) fitting the LSJM Right: the dots represent the overall positions z n and the arrows connect the estimated position under each model p(zn Yk ; Θk ) = (z nk, Σ k ). LSJM Protein-Protein Interactions LSJM ROC and BOX plots. p (y i j = Model). Missing (unobserved) links can be easily managed by the LSJM using the information given by all the network views... p (y i j = Model) LSJM Missing Links and Missing odes... AUC Genetic Interactions = 94 % AUC Physical Interactions = 9 %.... To estimate the probability of the presence or absence of an edge we employ the posterior mean of the αk and of the latent positions so that we get the following equation:.. Physical Interactions Genetic Interactions ROC curve yij =. yij = yij = yij = yijk = p(yijk = z i,z j, ξ k ) = ROC curves and Boxplots of the estimated probabilities of a link for the true negatives and true positives. LSJM Protein-Protein Interactions Missing Data... Missing Links (-fold cross validation): LSJM: misclassification rate of 9% for the genetic interaction network, and 6% for the physical interaction network. : misclassification rate of 8% for the genetic interaction network, and 7% for the physical interaction network. AUC Genetic Interactions = 66 % AUC Physical Interactions = 9 %... ROC curves fitting a LSJM (left) and single (right) If we want to infer whether to assign yijk = or not, we need to introduce a threshold τk, and let yijk = if p(yijk = z ik,z jk, ξ k ) > τk. LSJM: misclassification rate of 4% for the genetic interactions dataset and % for the physical interaction network. : useless since it would locate the nodes only relying on the prior information..... Missing odes (-fold cross validation):.. ROC curve Overall AUC Genetic Interactions = 9 % AUC Physical Interactions = 88 %. + exp(ξ k z i z j ) LSJM Protein-Protein Interactions Missing Data To evaluate the link prediction we applied a -fold cross validation setting the % of the links to be missing at each time point. ROC curve Overall exp(ξ k z i z j ) Try to improve the predictions using a higher dimension for the latent variables.

7 Conclusions References The joint models are particularly useful to: Locate unconnected nodes/subgraphs in the latent space. Estimate missing links. Wide range of applications Variational Bayes allows to deal with networks of thousands of nodes. Possible extentions: Joint models for directed networks using the inner product instead of Euclidean distance. Joint models for categorical nodal attributes (LTA instead of FA). Joint models with clusters (LPCM, MFA, MLTA). Beyond binary networks: Rank and Count data. Hoff, P. D., Raftery, A. E., and Handcock, M. S. (). Latent space approaches to social network analysis. Journal of the american Statistical association, 97 (46), Salter-Townshend, M., White, A., Gollini, I., and Murphy, T.B. (). Review of Statistical etwork Analysis: Models, Algorithms and Software, Statistical Analysis and Data Mining, 5 (4), Gollini, I., and Murphy, T.B. (). Joint Modelling of Multiple etwork Views, arxiv:.759

Package lvm4net. R topics documented: August 29, Title Latent Variable Models for Networks

Package lvm4net. R topics documented: August 29, Title Latent Variable Models for Networks Title Latent Variable Models for Networks Package lvm4net August 29, 2016 Latent variable models for network data using fast inferential procedures. Version 0.2 Depends R (>= 3.1.1), MASS, ergm, network,