More about liquid association

Size: px

Start display at page:

Download "More about liquid association"

Dale Ramsey
5 years ago
Views:

1 More about liquid association

2 Liquid Association (LA) LA is a generalized notion of association for describing certain kind of ternary relationship between variables in a system. (Li 2002 PNAS) low (-) Y high (+) Liquid Association low (-) X high (+) transit state 1 state 2 Linear (state 1) Linear (state 2) Green points represent four conditions for cellular state 1. Red points represent four conditions for cellular state 2. Blue points represent the transit state between cellular states 1 and 2. (X,Y) forms a LA. Profiles of genes X and Y are displayed in the above scatter plot. Important! Correlation between X and Y is 0

3 Statistical theory for LA X, Y, Z random variables with mean 0 and variance 1 Corr(X,Y)=E(XY)=E(E(XY Z))=Eg(Z) g(z) an ideal summary of association pattern between X and Y when Z =z g (z)=derivative of g(z) Definition. The LA of X and Y with respect to Z is LA(X,Y Z)= Eg (Z)

4 Statistical theory-la Theorem. If Z is standard normal, then LA(X,Y Z)=E(XYZ) Proof. By Stein s Lemma : Eg (Z)=Eg(Z)Z =E(E(XY Z)Z)=E(XYZ) Additional math. properties: bounded by third moment =0, if jointly normal transformation

5 Stein Lemma To compute E(g (Z)) is not easy. With help from mathematical statistics theory, the LA(X,Y Z) can be simplified as E(XYZ) when Z follows normal distribution. LA(X,Y Z) = E ( g (Z)) = E (Zg(Z)) = E(ZE(XY Z)) = E (E(XYZ Z)) = E(XYZ) Stein lemma

6 Lemma 1 : Eh (X)=h(1)-h(0) X uniform[0,1] h is differentiable Fundamental theorem of calculus Sir Issac Newton ( ) Gottfried Leibniz ( ) [from Wikipedia]

7 Lemma 2: Eh (X)= EXh(X) X~Normal(0,1) Stein s Lemma Charles Stein Integration by part Proof : Start from the right side Write down the density of X Integration by part

8 Lemma 3: EXh(X)= λeh(x+1) X~Poisson(λ) Chen-Stein method Poisson approximation Louis Chen National University of Singapore Director of IMS

9 Inadmissibility of normal mean when dimension 3 X 1 ~N(μ 1, σ 2 ) X 2 ~N(μ 2,σ 2 ) X 3 ~N(μ 3, σ 2 ) Squared error loss for estimating the mean parameters; variance known Risk = E{(X 1 - μ 1 ) 2 + (X 2 - μ 2 ) 2 + (X 3 - μ 3 ) 2 }=3 σ 2 Better estimate can be constructed by shrinkage toward the origin. y=(x 1,x 2,x 3 ) ; θ=(μ 1,μ 2,μ 3 ) By Stein s lemma, an unbiased estimate of the risk of Jame-Stein estimate can be constructed; 3 σ 2

10 Normality? Convert each gene expression profile by taking normal score transformation LA(X,Y Z) = average of triplet product of three gene profiles: (x 1 y 1 z 1 + x 2 y 2 z 2 +. ) / n

11 X, Y, Z Liquid Association is not Partial correlation Z->X, Z->Y (Causal analysis ) X=aZ+b+e 1 Y=a Z+b +e 2 Partial correlation of X and Y with respect to (adjusted by, given ) Z =corr (e 1, e 2 ) If Z causes X and Y, then partial correlation=0 (X=Coke sale, Y=eye disease incidence rate, Z=season) Starting with a pair of positively correlated genes Y, Z (corr(y,z) > 0 ), find X to reduce the partial correlation This procedure is very different from LA.

12 Quadratic relationship Sometimes liquid association may occur when X and Y have a quadratic trend. This is often the case when Z has good correlation with either X or Y For example, Y=X 2 + e 1,where X is normal with mean 0,variance 0; e1 mean 0, variance 1 Corr (X,Y)=0 Z= 0.8X+0.6e 2 ; e2 mean 0, variance 1 Show LA activity plot. E(XYZ)= 0.8EX 4 >0

13 Statistical significance P-value can be calculated by permutation test or by large sample approximation Plot of liquid association is provided by two methods: MLE for mixture model discrete method

14 Figure 3. Organization chart for incorporating LA with similarity based methods. Coexpressed genes found by profile similarity analysis can be pooled together to obtain a consensus profile for LA-scouting. Likewise, the genes identified through LA system can be further analyzed for patterns of clustering. For some applications, the scouting variable may come from external sources related to the expression profiles. SVD: singular value decomposition; PCA: principal component analysis. Full genome expression profiles Similarity based analysis LA-based analysis co-expression neighbors hierachical cluster eigen profile by SVD or PCA; etc. Finding LA-scouting genes for a given pair of genes Finding LAPs for a given scouting variable Z Using a consensus profile as Z using a gene profile as Z Using an external variable as Z Similarity based analysis

15 An website for co-mining public and inhouse data

16 Data sets Organisms : Primary: homo sapiens ; mouse; yeast Others: C. elegans; arabidopsis; e. coli Homo sapiens: 17 datasets (more are added now) 60 cell line_affy: 60 conditions, 5611 genes 60 cell line cdna: 60 conditions, 9706 genes GNF_atlas (2002): 101 conditions, genes GNF_atlas(2004): 158 conditions, genes Human eqtl (B-cell): 355 conditions, 8793 genes Lung caner : 4 data sets: Bhattacharjee et al : 203 conditions, genes Beer et al : 96 conditions, 7129 genes Gaber et al: 73 conditions, genes Wigle et al: 39 conditions, genes

17 Facilities Basic Correlation for a pair of genes Liquid association for a triplet of genes Enhancement Advanced search methods Gene symbols; gene locations; gene ontology; regulation (Transfac); locus link Compute Variations in computing LA scores Liquid association (default) Projective LA (for multiple genes)» Transformation LA scouting genes Correlation only Raw data; normal transformation Clustering: k-mean, hierarchical clustering, self-organizing methods ( still testing)

18 Facilities-continued Post-LA refining tools Summary Counts, histogram, GO, Pathway (still testing) Correlation Liquid association Instant link to Entrez Genes or SGD(yeast only) Liquid association graphs (two methods ) Save Info (gene annotation, from public domain) Gene_sym, Gene-Name, chrom, start, stop, etc. (expression data, computed) Indices, Ranks, Quantitles, Rank_LAP, Rank_Corr, Transfac GO term (for yeast now) Compute Correlation matrix (raw or normalized) Clustering (K means; hierarchical )

19 Facilities (continued) P* : permutation with 50,000 iterations (testing ) P** : permutation with 1,000,000 (does not work yet) Download (create excel files for exporting ) MAP (chromosome locations of output genes) Alert system MS markers MS candidate genes Yeast genetics User added system (talk to us) Disease pages (work in progress) Multiple sclerosis Group by Adding genes Delete; modify Computation methods. Databases,

20 Special tools (under development) For handling marker data Converting to binary data Additional links Precomputed data Master LA genes (for limited datasets for now) Protein Complex data (only in yeast for now) KEGG pathway

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.