Applications of admixture models

Size: px

Start display at page:

Download "Applications of admixture models"

Chastity Grant
6 years ago
Views:

1 Applications of admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price Applications of admixture models 1 / 27

2 Outline Admixture models Population structure and GWAS Applications of admixture models Admixture models 2 / 27

3 Mixture model for genetic data K: the number of populations (mixture components) π k : mixture weights they represent how much each population contributes to the final distribution f m,k : allele frequency in each of K populations. p(x, z) = p(z)p(x z) Applications of admixture models Admixture models 3 / 27

4 Mixture model for genetic data Denote K p(z = k π) = k=1 π k 1{z=k} Now, assume the conditional distributions are independent Binomial. p(x z = k, (f m,k ) M m=1) = m = m p(x m z = k) Bin(2, f m,k ) Then, the marginal distribution of x is K p(x θ) = p(z = k π)p(x z = k, f k ) k=1 The parameters θ = (π k, f k ) K k=1. Applications of admixture models Admixture models 3 / 27

5 Mixture model for genetic data Given N individuals over M SNPs : x n, n {1,..., N}, write the log likelihood LL(θ). Estimate the maximum likelihood parameters θ using EM. Applications of admixture models Admixture models 4 / 27

6 Mixture model for genetic data: Example Supervised mixture models SNPs Allele frequency POP POP Individual x Does individual x belong to population 1 or 2? P (Data x is in population 1) = (0.25) 2 (0.75) 0 (0.57) 0 (0.43) 2... = P (Data x is in population 2) = (0.40) 2 (0.60) 0 (0.32) 0 (0.68) 2... = Applications of admixture models Admixture models 5 / 27

7 Mixture model for genetic data: Example Supervised mixture models SNPs Allele frequency POP POP Individual x Does individual x belong to population 1 or 2? P (Data x is in population 1) = (0.25) 2 (0.75) 0 (0.57) 0 (0.43) 2... = P (Data x is in population 2) = (0.40) 2 (0.60) 0 (0.32) 0 (0.68) 2... = Applications of admixture models Admixture models 5 / 27

8 Mixture models for genetic data Unsupervised mixture models What if allele frequencies are not known? Use EM to infer parameters (HW problem). Applications of admixture models Admixture models 6 / 27

9 Admixture models/latent Dirichlet Allocation Clustering: sample belongs to exactly one cluster. In genetics: Cluster = population Individuals could belong to more than one population. Applications of admixture models Admixture models 7 / 27

10 Admixture models Individual can now have fractional memberships in each population. Each SNP can have different ancestry. Applications of admixture models Admixture models 8 / 27

11 Population admixture Admixed population is one that has ancestry from multiple distinct populations. Applications of admixture models Admixture models 9 / 27

12 Admixture better reflects human biology Applications of admixture models Admixture models 10 / 27

13 Admixture better reflects human biology 1 Hellenthal et al. Science 2014 Applications of admixture models Admixture models 11 / 27

14 Examples of admixed populations African-Americans: African and European ancestry. 10% of US population Latino Americans (Hispanics): European, Native American and African 15% of US population Mexican Americans, Puerto Ricans Hawaiians South Asians Middle Easterners Applications of admixture models Admixture models 12 / 27

15 Admixture and ancestry Applications of admixture models Admixture models 13 / 27

16 PCA on genetic data Applications of admixture models Admixture models 14 / 27

17 Admixture leads to variation in proportions of genome-wide ancestry Applications of admixture models Admixture models 15 / 27

18 PCA on HapMap Phase 3 Applications of admixture models Admixture models 16 / 27

19 PCA on HapMap Phase 3 Applications of admixture models Admixture models 16 / 27

20 Admixture model Each individual n has a parameter g n = (g n,1,..., g n,k ) where g n,k 0 and k g n,k = 1. Each population has a parameter for a SNP f k = (f 1,k,..., f M,k ). z n,m,l g n Mult(g n ), l {1, 2} x n,m z n,m,1, z n,m,2, f k Ber ( f m,zn,m,1 ) + Ber ( fm,zn,m,2 ) Applications of admixture models Admixture models 17 / 27

21 Inference in the admixture model Parameters:θ = (g n, f k ). Use EM to estimate parameters. E-step: Compute r (t) n,m,a,b p(z n,m = (a, b) x n,m, g (t) n, f (t) m,k ). M-step: Update estimates of the parameters. Work out the updates! Applications of admixture models Admixture models 18 / 27

22 Inference in the admixture model Parameters:θ = (g n, f k ). Use EM to estimate parameters. E-step: Compute r (t) n,m,a,b p(z n,m = (a, b) x n,m, g (t) n, f (t) m,k ). M-step: Update estimates of the parameters. Work out the updates! Applications of admixture models Admixture models 18 / 27

23 Admixture model for genetic data: Example Supervised admixture models SNPs Allele frequency POP POP Individual x Individual x has ancestry α from population 1 and (1 α) from population 2. Find α. P (Data α) = [0.25α (1 α)] 2 [(1 0.25)α + (1 0.40)(1 α)] 0 [0.57α (1 α)] 0 [(1 0.57)α + (1 0.32)(1 α)] 2... Maximum value of P attained at α = Applications of admixture models Admixture models 19 / 27

24 Admixture model for genetic data: Example Supervised admixture models SNPs Allele frequency POP POP Individual x Individual x has ancestry α from population 1 and (1 α) from population 2. Find α. P (Data α) = [0.25α (1 α)] 2 [(1 0.25)α + (1 0.40)(1 α)] 0 [0.57α (1 α)] 0 [(1 0.57)α + (1 0.32)(1 α)] 2... Maximum value of P attained at α = Applications of admixture models Admixture models 19 / 27

25 Applying admixture models to HGDP Human Genome Diversity Project Applications of admixture models Admixture models 20 / 27

26 Applying admixture models to HGDP Human Genome Diversity Project 2 Li et al. Science 2008 Applications of admixture models Admixture models 20 / 27

27 Admixture models outside of genetics Also known as topic models or LDA (Latent Dirichlet Allocation). Used to model topics in documents. Genotypes = words Individual = document Population = topic Each document has different distributions over topics. Each topic specifies distribution over words. Applications of admixture models Admixture models 21 / 27

28 Admixture models outside of genetics 3 Griffiths and Steyvers, PNAS 2004 Applications of admixture models Admixture models 21 / 27

29 Outline Admixture models Population structure and GWAS Applications of admixture models Population structure and GWAS 22 / 27

30 Population structure can lead to false discoveries Applications of admixture models Population structure and GWAS 23 / 27

31 Population structure can lead to false discoveries Applications of admixture models Population structure and GWAS 23 / 27

32 Appraches to deal with population stratification Structured association Cluster individuals into populations. Do GWAS in each population. Combine results. Applications of admixture models Population structure and GWAS 24 / 27

33 Appraches to deal with population stratification Principal Components Include Principal Components in the model. Applications of admixture models Population structure and GWAS 24 / 27

34 Example n = 200 m = 1000 Z n {1, 2} Z n = 1, n 100 Z n = 2, n > 100 { N (10, 1), Zn = 1 Y n Z n N (0, 1), Z n = 2 X n,m Z n Ber (f Zn,m) Applications of admixture models Population structure and GWAS 25 / 27

35 How well does the model fit? True ancestry Z known Applications of admixture models Population structure and GWAS 26 / 27

36 How well does the model fit? True ancestry Z unknown We find 222 SNPs that are statistically significant (p-value <.05/1000) Applications of admixture models Population structure and GWAS 26 / 27

37 How well does the model fit? Visualize these associations Applications of admixture models Population structure and GWAS 26 / 27

38 How well does the model fit? Visualize these associations in each population Applications of admixture models Population structure and GWAS 26 / 27

39 How well does the model fit? Infer PCs (PC scores for first PC) Applications of admixture models Population structure and GWAS 26 / 27

40 How well does the model fit? Infer PCs (PC1 vs PC2) Applications of admixture models Population structure and GWAS 26 / 27

41 How well does the model fit? Fraction of variance explained About 6% variance explained by PC1 Applications of admixture models Population structure and GWAS 26 / 27

42 How well does the model fit? Correct for PCs No association is significant! Applications of admixture models Population structure and GWAS 26 / 27

43 Summary PCA is an example of a latent variable model with continuous latent variable. Unlike clustering, where the latent variable is discrete. Probabilistic model corresponding to PCA. Admixture models or topic models or LDA are generalizations of clustering. Applications to infer ancestry and correct for population structure. Question: When do we include PCs in our regression? Applications of admixture models Population structure and GWAS 27 / 27

Estimating. Local Ancestry in admixed Populations (LAMP)

Estimating. Local Ancestry in admixed Populations (LAMP) Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number