Estimating. Local Ancestry in admixed Populations (LAMP)

Size: px

Start display at page:

Download "Estimating. Local Ancestry in admixed Populations (LAMP)"

Miles Waters
5 years ago
Views:

1 Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014

2 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number of SNPs to be analyzed

3 Ordered genotype 01 at this SNP (Possibilities: 00, 01, 10, 11) In one person s genome Position Chr (Mom) Chr (Dad) 1. Sketch Method

4 = Chromosome from Population 1 (e.g. Africans) = Chromosome from Population 2 (e.g. Europeans) Admixed Person (e.g. African American) LAMP Fraction Pop 2 Ancestry Position 1. Sketch Method

5 In each window, for each person, say that person has 0, 1, or 2 alleles from Pop 2 For each SNP position, majority vote over windows to get number of Pop 2 alleles: i=1 i=2 0 1 i=3 1 2 Window (L SNPs) Slide 1 allele from Pop 2 1. Sketch Method

6 INPUTS - SNP positions - Unphased genotypes (0/0, 0/1, or 1/1) of admixed people - K = 2 populations - Generations g = 7 of admixing - Fractions (α 1, α 2 ) = (0.8, 0.2) of all genomes in sample from Pop 1, Pop 2 - r: Recombination rate (# of breakpoints per meiosis per generation) - Offset: Fraction of window s bp length by which to shift window OUTPUT: - Fraction of Pop 2 alleles at each SNP position in each person 1. Sketch Method

7 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number of SNPs to be analyzed 2. Algorithm

8 1) Choose the window length L - Want big: Enough SNPs to tell apart ancestries - Want small: No breakpoint in the window - Expected fraction of bp positions in a window of 2 Chr s that is incorrectly called due to a breakpoint. Fraction = 1/2 here: Call: Truth: Pick the largest L such that epsilon bounds the expected fraction of errors in a window L e (g -1)r(a1)(a2) 2. Algorithm

9 For each window: 2) Prune SNPS to get independence - Plink: Window of 50, slide 5, remove SNP if r 2 > 0.1 (what I did) - Greedy removal of SNPs with r 2 > 0.1 (paper) 3) MAXVAR algorithm to get initial ancestry estimates 0, 1, or 2 for a window 4) ICM iterates Eqn s (1) and (2), with EM to solve Eqn (2) Get a window s call 0, 1, or 2, for the number of Pop 2 alleles for each person For person i, θ i = θ 1 i, θ 2 i 1, 2 x 1, 2, like an unordered genotype # of Pop 2 alleles θ i (1,1) (1,2) = (2,1) (2,2) 2. Algorithm

10 If know: - MAFs f 1,, f K f K1,, f Kn of K ancestral populations - Genotypes G 1,, G m of m individuals at n SNPs Then estimate individual i s ancestries θ i = θ 1 i, θ 2 i 1,, K x 1,, K in the window as θ i = argmax θ(i) P θ(i) f 1,, f K, G i ) (1) E-Step = argmax θ i P G i f 1,, f K, θ i ) P(θ i f 1,, f K ) Mode, not mean, so called ICM not EM. 2. Algorithm

11 θ i = argmax θ(i) P θ(i) f 1,, f K, G i ) (1) E-Step = argmax θ i P G i f 1,, f K, θ i ) P(θ i f 1,, f K ) s, t P(θ i = s, t f 1,, f K ) = α s α t 1[s = t] + 2α s α t 1[s t] 2. Algorithm

12 If know: - Individual i s ancestries θ i = θ 1 i, θ 2 i 1,, K x 1,, K - Genotypes G 1,, G m of m individuals at n SNPs Then estimate ancestral MAFs f 1,, f K as argmax f1,,f K m i=1 P(G i f 1,, f K, θ(i)) (2) 2. How LAMP Works

13 If ordered genotype G ij, then MLEs of f 1,, f K that maximize Equation (2): Count minor alleles from each population. Divide by number of alleles from the population. i = MLE of f 1 = ( 1 3, 1 2, 1 2, 0) i = j MLE of f 2 = (1, 1 2, 1, 1 3 ) 2. Algorithm

14 If unordered genotype, cannot simply count to solve equation (2), because see: not Genotype 0/1 with ancestry 1/2: Count minor allele 1 to Pop 1 or Pop 2? λ j i is a K-long vector with 1 in the s-th entry if the minor allele gets assigned to Pop s: 2. Algorithm and P λ j i = e θ1 (i) f 1j,, f Kj, θ i ) = f θ1 i j(1 f θ2 i j) P λ j i = e θ2 (i) f 1j,, f Kj, θ i ) = f θ2 i j(1 f θ1 i j) P λ j i = s e θ1 (i), e θ2 (i) f 1j,, f Kj, θ i ) = 0

15 If unordered genotypes, write equation (2) in terms of λ j i and solve via EM: m i=1 j not amb ) (2) argmax f1,,f ( P(G K j amb ij f 1,, f K, θ i ) ( P(G ij f 1,, f K, θ i ) For ambiguous j, P(G ij f 1,, f K, θ i ) =P λ j i = e θ1 (i) f i1,, f Kj, θ i ) + P λ j i = e θ2 (i) f i1,, f Kj, θ i ) 2. Algorithm

16 Let λ j,s i = s-th coordinate of λ j i, indicating whether the minor allele at position j in individual i gets assigned ancestry s. Start θ(i) from MAXVAR E-Step: λ j,s i = E λ j,s i f θ1 i j, f θ2 i j, θ i, G ij = 1) = f θ1 i j 1 f θ2 i j f θ1 i j 1 f θ2 i j + (1 f θ1 i j)f θ2 i j, if s = θ 1 i f θ2 i j 1 f θ1 i j f θ1 i j 1 f θ2 i j + (1 f θ1 i j)f θ2 i j, if s = θ 2 i 2. Algorithm

17 M-Step: 2n sj 2,2 + n sj 2,1 + n sj 1,2 + j ambiguous λ j,s i f sj = 2n sj 2,2 + 2n sj 2,1 + 2n sj 2,0 + n sj 1,2 + n sj sj 1,1 + n 1,0 - n sj k,u = Number of individuals with u 0,1,2 minor alleles and k 1,2 copies of alleles from population s at site j - Iterate EM to get MLEs of f 1,, f K solving Eq (2) - Plug MLEs of f 1,, f K into Eq (1) 2. Algorithm

18 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number of SNPs to be analyzed

19 Simulate Data - Chr 1 SNP genotypes from HapMap - YRI: African. CEU: European. JPT: Japanese. CHB: Chinese. - Repeat for g = 7 generations: - Let (α 1, α 2 ) = (0.8, 0.2) - {α 1 m people from Pop 1, α 2 m people from Pop 2} = Combined set - Create child Chr 1: Randomly pick parents from combined set, generate recombinations with r = 10 8, randomly transmit a chromosome from each parent - Create children until there are m children 3. Simulated Data: Accuracy Varying

20 Analyze bottom generation of admixed data with LAMP, given different parameters (Pop1Pop2, r 2 pruning threshold, number n of SNPs) Under each setting, obtain matrix of ancestry estimates: Individual i 0, 1, or 2 for the number of alleles from Pop 2 SNP j 3. Simulated Data: Accuracy Varying

21 Vary Pop 1-Pop 2 Setting Number pruned SNPs (my LAMP) Number pruned SNPs (real LAMP) Overall accuracy Pop1-Pop2 (my LAMP) 1 yri-ceu ceu-jpt jpt-chb Common across settings: r 2 = 0.1, m = 10 people, n = 4000 SNPs. Overall accuracy (real LAMP) - Real LAMP: Accuracy goes down. Harder to distinguish similar populations. - My LAMP: Accuracy about same - Real LAMP is doing much worse than paper s > 90% accuracies: - Paper used m = 500 => Estimate f s better - Despite same r 2, different numbers of SNPs pruned in 3. Simulated Data: Accuracy Varying

22 Accuracy by Person (for YRI CEU) Accuracy (Avg over SNPs) Accuracy (Mean over SNPs) Accuracy By Person (YRI-CEU) Red: My LAMP Black: Real LAMP ID Individual ID (1-10)

23 Accuracy by Person (for CEU JPT) Accuracy (Mean over SNPs) Accuracy (Mean over SNPs) Accuracy By Person (CEU-JPT) Red: My LAMP Black: Real LAMP ID Individual ID (1-10)

24 Accuracy by Person (for JPT CHB) Accuracy (Mean over SNPs) Accuracy (Mean over SNPs) Accuracy By Person (JPT-CHB) Red: My LAMP Black: Real LAMP ID Individual ID (1-10)

25 Vary r 2 pruning threshold Number pruned Overall accuracy (my Setting r 2 SNPs (my LAMP) LAMP) Common to settings: P1-P2 = YRI-CEU, m = 10 people, n = 4000 SNPs - Accuracy goes down very slightly - Consistent with paper s slight change in accuracy when varying r 2 (m = 500, n = 38,864) 3. Simulated Data: Accuracy Varying

26 Vary number n of SNPs Setting n snps Number pruned SNPs (my LAMP) Overall accuracy (my LAMP) Common to settings: P1-P2 = YRI-CEU, m = 10 people, r 2 = Reduce n Lower accuracy - Intuition: n = 4000 SNPs has 9 windows, while n = 1000 SNPs has 1 window, so no majority vote 3. Simulated Data: Accuracy Varying

27 Questions?

28 Similarity score between individuals i1 and i2 S i 1, i 2 = n j=1 G i1,j u j σ j 2 G i2,j u j which is like a sample correlation(g i1, G i2 ). Higher score, more similar. G i,j is the Genotype (# of minor alleles) at position j in individual i. u j = σ 2 j = m G ij i=1 is the Genotype at position j averaged over m individuals i. m m 2 G ij u j i=1 is the variance in Genotype m 2. How LAMP Works

29 A) For each individual i, Var(i) = S i, i 2 i :i i measures how genetically similar i is to everyone else (the other m - 1 people). B) Find the person i* who is most genetically similar to everyone else: i* = argmax i Var i. C) Assign initial ancestry estimates of θ i based on how similar i is to i*: - Assume - - i* has ancestry (2, 2) 0 S(i*, i*) For each individual i, assign ancestry to i: S(i, i*) - Lowest 1 α 2 n scores S i, i (1, 1) - Highest α 2 n scores S i, i (2, 2) - Everyone else (1, 2) (Pop1, Pop 1) (Pop1, Pop 2) (Pop2, Pop 2) θ i = (θ 1 i, θ 2 (i)) 2. How LAMP Works

Applications of admixture models

Applications of admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price Applications of admixture models 1 / 27