The qp Package. December 21, PDF Free Download

The qp Package December 21, 2006 Type Package Title q-order partial correlation graph search algorithm Version 0.2-1 Date 2006-12-18 Author Robert Castelo <robert.castelo@upf.edu>, Alberto Roverato <alberto.roverato@unibo.it> Maintainer Robert Castelo <robert.castelo@upf.edu> Depends R (>= 2.2.1) the q-order partial correlation graph search algorithm, q-partial, or qp, algorithm for short, is a robust procedure for structure learning of undirected Gaussian graphical Markov models from small n, large p data, that is, multivariate normal data coming from a number of random variables p larger than the number of multidimensional data points n as in the case of, e.g., microarray data. License GPL version 2 or newer R topics documented: jmlr06data.......................................... 2 qp.............................................. 2 qp.analyse.......................................... 4 qp.ci.test........................................... 5 qp.clique........................................... 6 qp.edge.prob........................................ 8 qp.get.cliques........................................ 9 qp.graph........................................... 10 qp.hist............................................ 11 qp.matrix.image....................................... 12 qp.search.......................................... 13 Index 15 1

2 qp jmlr06data Synthetic data from the article by Castelo and Roverato (2006) Format Synthetic data generated from two graphs with 150 vertices, G 1 and G 2. In G 1 the boundary of every vertex is at most 5, while in G 2 the boundary of every vertext is at most 20 IC.bd5: inverse correlation matrix encoding the independence structure of G 1 IC.bd20: inverse correlation matrix encoding the independence structure of G 2 S.bd5.N20: sample covariance matrix from a sample of size 20 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd5 S.bd5.N50: sample covariance matrix from a sample of size 50 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd5 S.bd5.N150: sample covariance matrix from a sample of size 150 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd5 S.bd20.N20: sample covariance matrix from a sample of size 20 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd20 S.bd20.N50: sample covariance matrix from a sample of size 50 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd20 S.bd20.N150: sample covariance matrix from a sample of size 150 drawn from a normal distribution with mean 0 and inverse correlation matrix IC.bd20 qp.out.bd5.n20.q10: output from qp.search applied to S.bd5.N20 with q=10 and T=500 qp.out.bd20.n20.q10: output from qp.search applied to S.bd20.N20 with q=10 and T=500 qp The package qp : summary information

qp 3 This package provides functions for implementing the q-order partial-correlation graph search algorithm, q-partial, or qp, algorithm for short. The qp algorithm is a robust procedure for structure learning of undirected Gaussian graphical Markov models (UGGMMs) from "small n, large p" data, that is, multivariate normal data coming from a number of random variables p larger than the number of multidimensional data points n as in the case of, e.g., microarray data. Data jmlr06data synthetic data used in the referenced article Functions qp.search calculates the estimates of the non-rejection rates for every pair of variables qp.edge.prob calculates the estimate of the non-rejection rate for a particular pair of variables, this function is also called by qp.search qp.ci.test performs a test for conditional independence qp.analyse provides some exploratory analyses on the output of qp.search qp.clique calculates the maximum clique size as a function of the minimum threshold on the non-rejection rate for removing an edge qp.hist shows a histogram of the estimated non-rejection rates obtained through qp.search qp.graph returns the qp-graph, in the form of an incidence matrix, resulting of thresholding the non-rejection rates in the output of qp.search qp.matrix.image makes an image plot of the absolute value of an inverse correlation matrix qp.get.cliques finds the set of cliques of an undirected graph The package provides an implementation of the procedures described by Castelo and Roverato (2006) and is a contribution to the gr-project described by Lauritzen (2002). Authors Robert Castelo, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Barcelona, Spain. Alberto Roverato, Dipartimento di Scienze Statistiche, Università di Bologna, Italy. Lauritzen, S. L. (2002). graphical Models in R. R News, 3(2)39.

4 qp.analyse qp.analyse Performs some exploratory analyses on the q-partial graph Using the output of qp.search this function provides some exploratory analyses on the resulting q-partial graph. qp.analyse(qp.output, threshold, largest.clique=true, plot.image=true, exact.calculation=false, approximation.iterations=100) qp.output output of qp.search threshold threshold on the minimum non-rejection rate required for edge removal largest.clique when this flag is set to TRUE it calculates the size of the largest clique plot.image when this flag is set it plots the incidence matrix resulting of thresholding the non-rejection rates exact.calculation when this flag is set to TRUE, the exact maximum clique size is calculated and when set to FALSE a lower bound is calculated instead. It applies only when largest.clique=true approximation.iterations number of iterations performed to calculate the lower bound on the clique number of each graph. It applies only when largest.clique=true and \ exact.calculation=false Details Returns an object of the class matrix showing the number of selected edges, the number of edges of the complete graph and the percentage of selected edges. When largest.clique=true it gives also the size of the largest clique and when plot.image=true it plots the incidence matrix resulting of thresholding the non-rejection rates. Beware that setting largest.clique=true and exact.calculation=true when giving breakpoints between 0.95 and 1.0 (which may result into very dense graphs) can lead to a very long time of computation due to the NP-completeness of the problem of calculating the size of the largest clique which is therefore bounded by an exponential growth of the running time as function of the graph density (cf. Pardalos and Xue, 1994). The lower bound on the maximum clique size is calculated by ranking the vertices by their connectivity degree, put the first vertex in a set and go through the rest of the ranking adding those vertices

qp.ci.test 5 to the set that form a clique with the vertices currently within the set. Once the entire ranking has been examined a large clique should have been built and hopefully the largest one. This process is repeated a number of times (approximation.iterations) each of which the ranking is altered with increasing levels of randomness acyclically (altering 1 to p vertices and again). Larger values of approximation.iterations should provide tighter lower bounds and eventually the exact maximum clique size (the clique number). Pardalos, P.M. and Xue, J. (1994). The maximum clique problem, J. Global Optim., 4:301-328 qp.search, qp.clique qp.analyse(qp.out.bd5.n20.q10,threshold=0.9,largest.clique=true) qp.ci.test Conditional independence test Performs a test for conditional independence between variables indexed by i and j given the conditioning set Q qp.ci.test(s, N, i=1, j=2, Q=c(), binary=true) S N i j Q binary sample variance-covariance matrix sample size index of one variable index of another variable conditioning set flag to switch to the compiled C code

6 qp.clique Details By default binary=true and the compiled and faster C code corresponding to this function will be executed. If binary=false is set, then the R code will be executed. Value t.value p.value the t-statistic value the p-value on rejecting the null hypothesis of conditional independence qp.edge.prob S <- S.bd5.N20 N <- 20 qp.ci.test(s,n,i=3,j=4,q=c(5,6,7)) qp.clique Relationship between non-rejection rate and maximum clique size Using the output of qp.search this function calculates the maximum clique size as a function of the minimum threshold on the non-rejection rate for removing an edge qp.clique(qp.output, N, threshold.lim=c(0,1), breaks=5, plot.image=true, exact.calculation=false, approximation.iterations=100)

qp.clique 7 qp.output N output of qp.search sample size threshold.lim range of the non-rejection rate threshold on which calculate the funcion breaks plot.image one of: a vector giving the breakpoints along the range defined by threshold.lim, a single number giving the number of equidistant breakpoints that divide the range defined by threshold.lim. when this flag is set to TRUE, the qp.clique plot is produced exact.calculation when this flag is set to TRUE, the exact maximum clique size is calculated and when set to FALSE a lower bound is calculated instead approximation.iterations number of iterations performed to calculate the lower bound on the clique number of each graph. It applies only when exact.calculation=false Details The qp.clique plot provides information on the graphs potentially selected by specifying different values of the threshold. Every circle in the plot corresponds to a graph and has three values associated with it: the threshold value used to construct the graph (horizontal axis); the number of vertices of the largest clique of the graph (vertical axis); the percentage of present edges in the graph (number inside the plot, beside the circle). Furthermore, adjacent circles are joined by a line and the dotted horizontal line corresponds to the sample size N. Beware that setting exact.calculation=true and giving breakpoints between 0.95 and 1.0, may result into very dense graphs which can lead to a very long time of computation due to the NPcompleteness of the problem of calculating the size of the largest clique which is therefore bounded by an exponential growth of the running time as function of the graph density (cf. Pardalos and Xue, 1994). The lower bound on the maximum clique size is calculated by ranking the vertices by their connectivity degree, put the first vertex in a set and go through the rest of the ranking adding those vertices to the set that form a clique with the vertices currently within the set. Once the entire ranking has been examined a large clique should have been built and hopefully the largest one. This process is repeated a number of times (approximation.iterations) each of which the ranking is altered with increasing levels of randomness acyclically (altering 1 to p vertices and again). Larger values of approximation.iterations should provide tighter lower bounds and eventually the exact maximum clique size (the clique number). Value threshold size threshold on the non-rejection rate that provides the maximum clique size that is strictly smaller than the sample size N maximum clique size strictly smaller than the sample size N

8 qp.edge.prob Pardalos, P.M. and Xue, J. (1994). The maximum clique problem, J. Global Optim., 4:301-328 qp.search qp.clique(qp.out.bd5.n20.q10,20) qp.edge.prob Estimate of the non-rejection rate Calculates the estimate of the non-rejection rate for a pair of variables, that is, the proportion of conditional independence tests that accept the null hypothesis of zero partial correlation given the q-order conditionals. qp.edge.prob(s, N, i=1, j=2, q=0, T=500, significance=0.05, binary=true) S N i j q T Details sample variance-covariance matrix sample size index of one variable index of another variable partial-correlation order number of tests per adjacency significance significance level of each test binary flag to switch to the compiled C code By default binary=true and the compiled and faster C code corresponding to this function will be executed. If binary=false is set, then the R code will be executed.

qp.get.cliques 9 qp.search, qp.ci.test S <- S.bd5.N20 N <- 20 q <- 6 T <- 100 qp.edge.prob(s,n,i=3,j=4,q,t) qp.get.cliques Cliques of an undirected graph It finds the set of cliques, i.e. maximal complete subsets of vertices, of an undirected graph taken as an incidence matrix. qp.get.cliques(i, binary=true) I binary incidence matrix flag to switch to the compiled C code Details It uses the algorithm described in Bron and Kerbosch (1973) and returns a list where each member is a vector of vertices forming a clique in the given graph. Beware that the problem of finding the set of cliques is NP-complete and the time of computation of this algorithm grows exponentially in the graph density (number of actual edges over the total number of adjacencies).

10 qp.graph Bron, C. and Kerbosch, J (1973). Finding all cliques of an undirected graph, Commun. ACM, 16:575 577 qp.graph, qp.clique I <- qp.graph(qp.out.bd5.n20.q10,threshold=0.9) cliquelist <- qp.get.cliques(i) sprintf("the graph has %d cliques\n",length(cliquelist)) qp.graph Incidence matrix of the qp-graph Using the output of qp.search this function returns the qp-graph, in the form of an incidence matrix, resulting of thresholding the non-rejection rates in the output of qp.search qp.graph(qp.output, threshold) qp.output threshold output of qp.search threshold on the non-rejection rate

qp.hist 11 qp.search I <- qp.graph(qp.out.bd5.n20.q10,threshold=0.9) sprintf("the graph has %.0f edges\n",sum(i)/2) qp.hist Histogram of the non-rejection rates Using the output of qp.search this function plots the histogram of the estimated non-rejection rates. When the inverse correlation matrix from the generative graph is given, it provides additional plots of information. qp.hist(qp.output, IC=NULL, prob=false) qp.output IC prob output of qp.search inverse correlation matrix from the generative graph when this flag is set to TRUE the histograms show densities, otherwise they show absolute frequencies. qp.search

12 qp.matrix.image # if are working with synthetic data and have the IC matrix qp.hist(qp.out.bd5.n20.q10,ic.bd5,prob=true) # otherwise, we just look at the non-rejection rate distribution qp.hist(qp.out.bd5.n20.q10,null,prob=true) qp.matrix.image Image of an inverse correlation matrix Makes an image plot of the absolute value of an inverse correlation matrix and reports the number of edges of the corresponding independence graph, the total number of adjacencies of the graph and the percentage of edges respect to this total number of adjacencies qp.matrix.image(m, col=null, plot=true) M col plot the matrix to make the image plot flag that when set to NULL the gray scale is used in the plot when this flag is set to TRUE it plots the function Details Returns an object of the class matrix containing the the number of edges of the corresponding independence graph, the total number of adjacencies of the graph and the percentage of edges respect to this total number of adjacencies. When plot=true it plots the partial correlation coefficients as a matrix.

qp.search 13 qp.matrix.image(ic.bd5) qp.search Matrix of non-rejection rates Calculates the estimates of the non-rejection rates for every pair of variables. qp.search(s, N, q=0, T=500, significance=0.05, binary=true) S N T q Details Value sample variance-covariance matrix sample size number of tests per adjacency partial-correlation order significance significance level of each test binary flag to switch to the compiled C code By default binary=true and the compiled and faster C code corresponding to this function will be executed. If binary=false is set, then the R code will be executed. A T matrix with the acceptance test counts number of tests per adjacency (copied from the input parameter) qp.edge.prob, qp.analyse, qp.hist

14 qp.search S <- S.bd5.N20 N <- 20 q <- 6 T <- 100 qp.out <- qp.search(s,n,q,t)

Index Topic datasets jmlr06data, 1 Topic graphs qp, 2 qp.analyse, 3 qp.ci.test, 5 qp.clique, 6 qp.edge.prob, 8 qp.get.cliques, 9 qp.graph, 10 qp.hist, 11 qp.matrix.image, 12 qp.search, 13 Topic models qp, 2 qp.analyse, 3 qp.ci.test, 5 qp.clique, 6 qp.edge.prob, 8 qp.get.cliques, 9 qp.graph, 10 qp.hist, 11 qp.matrix.image, 12 qp.search, 13 Topic multivariate qp, 2 qp.analyse, 3 qp.ci.test, 5 qp.clique, 6 qp.edge.prob, 8 qp.get.cliques, 9 qp.graph, 10 qp.hist, 11 qp.matrix.image, 12 qp.search, 13 qp, 2 qp.analyse, 2, 3, 13 qp.ci.test, 2, 5, 8 qp.clique, 3, 4, 6, 9 qp.edge.prob, 2, 5, 8, 13 qp.get.cliques, 3, 9 qp.graph, 3, 9, 10 qp.hist, 3, 11, 13 qp.matrix.image, 3, 12 qp.out.bd20.n20.q10 (jmlr06data), 1 qp.out.bd5.n20.q10 (jmlr06data), 1 qp.search, 2 4, 6 8, 10, 11, 13 S.bd20.N150 (jmlr06data), 1 S.bd20.N20 (jmlr06data), 1 S.bd20.N50 (jmlr06data), 1 S.bd5.N150 (jmlr06data), 1 S.bd5.N20 (jmlr06data), 1 S.bd5.N50 (jmlr06data), 1 IC.bd20 (jmlr06data), 1 IC.bd5 (jmlr06data), 1 jmlr06data, 1, 2 15

The qp Package. December 21, 2006