Grouping preprocess for haplotype inference from SNP and CNV data

Size: px

Start display at page:

Download "Grouping preprocess for haplotype inference from SNP and CNV data"

Thomasine Woods
5 years ago
Views:

Journal of Physics: Conference Series Grouping preprocess for haplotype inference from SNP and CNV data Recent citations - A haplotype inference method based on sparsely connected multi-body ising

1 Journal of Physics: Conference Series Grouping preprocess for haplotype inference from SNP and CNV data Recent citations - A haplotype inference method based on sparsely connected multi-body ising model Masashi Kato et al To cite this article: Hiroyuki Shindo et al 2009 J. Phys.: Conf. Ser View the article online for updates and enhancements. This content was downloaded from IP address on 14/10/2018 at 18:16

2 Grouping preprocess for haplotype inference from SNP and CNV data Hiroyuki Shindo 1, Hiroshi Chigira 1, Tomoyo Nagaoka 1, Naoyuki Kamatani 2 and Masato Inoue 1 1 Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, 3 4 1, Okubo, Shinjuku-ku, Tokyo , Japan 2 Institute of Rheumatology, Tokyo Women s Medical University, 10 22, Kawada-cho, Shinjuku-ku, Tokyo , Japan masato.inoue@eb.waseda.ac.jp Abstract. The method of statistical haplotype inference is an indispensable technique in the field of medical science. The authors previously reported Hardy-Weinberg equilibrium-based haplotype inference that could manage single nucleotide polymorphism (SNP) data. We recently extended the method to cover copy number variation (CNV) data. Haplotype inference from mixed data is important because SNPs and CNVs are occasionally in linkage disequilibrium. The idea underlying the proposed method is simple, but the algorithm for it needs to be quite elaborate to reduce the calculation cost. Consequently, we have focused on the details on the algorithm in this study. Although the main advantage of the method is accuracy, in that it does not use any approximation, its main disadvantage is still the calculation cost, which is sometimes intractable for large data sets with missing values. 1. Introduction Genome-wide association studies have been revealing the relationships between genetic variations and phenotypic traits. Current high-throughput genotyping technologies have provided us with a number of single nucleotide polymorphisms (SNPs), which could be associated with common diseases such as type II diabetes or rheumatoid arthritis [1]. In addition to SNPs, it has been reported that structural variations such as copy-number variations (CNVs) affect phenotypic traits. CNVs are genetic variations in the copy number of DNA segments in comparison with a reference genome, a copy unit of which ranges from 1 kilobase to several megabases. CNVs are caused by genetic deletion or duplication, and they cover more than 12% of the human genome [2]. Since SNPs and CNVs are in linkage disequilibrium [3, 4], we need to do association analyses of integrated SNP-CNV sequences. Haplotype information is crucial for association tests and other genetic studies because it is more powerful than information about the alleles of a single marker. (A haplotype is a single genetic constituent of an individual chromosome inherited from the father or mother.) Since experimental genotype data are observed as unphased pairs of alleles in SNPs and the total number of segment copies in CNVs of over two chromosomes, the haplotype phase is usually determined by statistical and computational methods. Although various methods of haplotype phasing have been proposed [5, 6, 7, 8, 9, 10, 11], they cannot be directly applied to SNP-CNV haplotype phasing. One of the most difficult c 2009 Ltd 1

3 problems is that the number of possible haplotypes exponentially increases for the number of heterozygous loci. One of the standard methods used for haplotype inference, the EM algorithm [6], also suffers from huge calculation costs. In this paper, we propose a haplotype-phasing algorithm based on the grouping preprocess (GrEM) [12], which can be applied to SNP and CNV genotypes. The GrEM involves a method of efficiently exploring haplotypes whose frequency in population is exactly zero in the maximum likelihood solution without using any approximation to reduce the number of possible haplotypes. From the mathematical aspect, SNPs appear as logical additions of alleles on two chromosomes, while CNVs appear as arithmetical additions. The proposed method can handle mixed SNP and CNV data through an integrated framework, which utilizes the common properties that both SNP and CNV have. We tested our algorithm by applying it to real and artificial unphased genotypes and found that it relaxed limitations on the calculation cost. 2. Model Here, we define the probabilistic model for the data. We assume that the observed data are unphased multilocus genotypes. For example, [A/T, A/C, G/G, 3] represents 4 polymorphic genotypes composed of 3 SNPs and 1 CNV. A/T denotes the heterozygous genotype, i.e., one of the alleles is A and the other is T. 3 denotes the total number of allelic copies over two chromosomes, i.e., the combinations of the numbers can be 0 and 3 or 1 and 2. Here, we assume that the number of the alleles of each SNP locus is two; this is the ordinary situation. As to the copy number of each CNV site, we made no assumptions, so any non-negative integer is acceptable. For simplicity, we shall use A to express a major allele or homozygous genotype of major alleles for each SNP locus. Similarly, C is used for minor allele(s). Incidentally, as our method does not distinguish whether a certain allele is major or minor, A and C are absolutely exchangeable throughout this paper. B is also used for briefly expressing a heterozygous genotype A/C. According to these notations, [A/T, A/C, G/G, 3] is expressed as [B, B, A, 3] or briefly as [BBA3]. First, we define θ as the haplotype frequencies of the population. For example, in the 3 SNPs and 1 CNV case, where the maximum copy number of the CNV locus is 3, θ = (θ [AAA0],θ [AAA1],..., θ [CCC3] )isa dimensional vector. As θ are frequencies, the sum of each element is normalized to 1 ( h θ h 1), where h denotes every possible haplotype. The diplotype configuration for each individual is assumed to follow the Hardy-Weinberg equilibrium (HWE), { θ 2 P (d i θ) di,1 if d i,1 = d i,2, (1) 2θ di,1 θ di,2 if d i,1 d i,2 where d i (d i,1,d i,2 ) (i =1, 2,..., I) denotes the pair of haplotypes for the ith individual and θ di,1 denotes the haplotype frequency for haplotype d i,1. To prevent permutation symmetry, we impose a constraint, d i,1 d i,2,ond i by defining a certain order. The probability for each individual is assumed to be independent. Also, measurement of the genotypes is assumed to include no errors: P ({g i }, {d i } θ) I δ di g i P (d i θ), (2) i where g i denotes the observed unphased genotype data for the ith individual, {g i } denotes the set of data for all individuals, I denotes the number of individuals in a data set, and δ denotes an indicator function, which yields 1 when the given condition is true and 0 otherwise. d i g i denotes that a diplotype, d i, is consistent with the genotype data, g i. 2

4 Our final target is to determine the optimal diplotype configurations, { ˆd i }, such as arg max {di } P ({d i } {g i }) by assuming a certain prior distribution, P (θ), but this discrete optimization problem is usually hard to solve. Instead, we take a two-step approach; we first determine the maximum likelihood estimator of θ: Then, we infer ˆθ arg max P ({g i } θ). (3) θ { ˆd i } arg max {d i } P ({d i} {g i }, ˆθ) (4) using the determined θ. Equation (4) is easy to solve, but Eq. (3) is not. The EM algorithm is an effective approach to solving this problem; it is an iterative procedure to estimate the maximum likelihood estimator by starting randomly with set θ (0) and sequentially calculating θ (t+1) arg max θ P ({d i } {g i }, θ (t) )lnp({g i }, {d i } θ) (5) {d i } under the expectation that θ (t) will converge to the maximum likelihood estimator. The EM algorithm guarantees convergence to a locally optimal but not globally optimal solution. Therefore, several trials with different initial values are carried out, and the best solution obtained from these is employed. Equation (4) can be reduced to θ (t+1) h I θ (t) d i,1 θ (t) d i,2 (δ di,1 =h + δ di,2 =h) (6) i d i g i in this haplotype-inference problem. Thus, the calculation cost is roughly on the order of the total number of possible diplotype configurations satisfying d i g i over all individuals. This calculation cost occasionally becomes intractable because it increases exponentially with the number of heterozygous loci. For example, if the genotype data of a certain individual are [BB...B] with 50 SNP loci, the calculation cost is O( ). 3. Theory The idea behind the GrEM we previously proposed is simple. Our direct task here is to find one of the optimal solutions, ˆθ, which maximizes the following likelihood function under given genotype data {g i }, P ({g i } θ) = I i d i g i P (d i θ). (7) By focusing on the symmetrical structure of this function, the GrEM reduces the calculation cost. Here, we show several simple examples composed of only SNP loci, but this theory is also valid for CNV only data and the mixed data of SNP and CNV. For the first example, let us assume one individual has [ABBB], as shown in figure 1(a). In this case, the likelihood function is given as P ({g i } θ) =(θ [AAAA] θ [ACCC] + θ [AAAC] θ [ACCA] + θ [AACA] θ [ACAC] + θ [AACC] θ [ACAA] ) (8) There are four haplotype pairs in this equation, and obviously, assigning non-zero values to one of them and zero values to the others produces the optimal solution. Actually, 3

5 θ [AAAA] = θ [ACCC] =1/2 is one of the optimal solutions (note that other θs are zero because h θ h 1). In other words, these four haplotype pairs are equivalent to one another because of the symmetry, and only one pair has non-zero probabilities that satisfy the optimality of the likelihood function. We can thus set the probabilities of three arbitrary haplotype pairs to zero while running the EM algorithm. If there are numerous heterozygous loci, this type of manipulation greatly reduces the calculation cost. For the second example, let us assume two individuals have [ABBB] and [BAAB], as shown in figure 1(b). In this case, the likelihood function is given as P ({g i } θ) =(θ [AAAA] θ [ACCC] + θ [AAAC] θ [ACCA] + θ [AACA] θ [ACAC] + θ [AACC] θ [ACAA] ) (θ [AAAA] θ [CAAC] + θ [AAAC] θ [CAAA] ). (9) As a result, there are two optimal solutions, the first is θ [AAAA] =1/2, θ [ACCC] = θ [CAAC] =1/4, and the second is θ [AAAC] =1/2, θ [ACCA] = θ [CAAA] =1/4. As we can easily guess that [AAAA] and [AAAC] are included in both g 1 and g 2, they have advantages. [ACCC], [ACCA], [CAAC], and [CAAA] also have advantages because each of them can be paired with an advantageous one. As the remaining four haplotypes, [ACAC], [ACAA], [AACC], and [AACA] are less advantageous, we can set their probabilities to zero. This type of manipulation also greatly reduces the calculation cost. Similar to the first example, two haplotype combinations { [AAAA], [ACCC], [CAAC] } and { [AAAC], [ACCA], [CAAA] } are symmetrical and equivalent to each other. Thus, we can set the probabilities of one of them to zero. Such optimality can easily be demonstrated through simple equations. Assume θ such that every θ in Eq. (9) is greater than zero. Also, suppose θ such that θ [AAAA] = θ [AAAA] + θ [AAAC] + θ [AACA] + θ [AACC] θ [ACCC] = θ [ACCC] + θ [ACCA] + θ [ACAC] + θ [ACAA] θ [CAAC] = θ [CAAC] + θ [CAAA] θ [AAAC] = θ [AACA] = θ [AACC] = θ [ACCA] = θ [ACAC] = θ [ACAA] = θ [CAAA] =0. (10) Then, as the likelihood of θ is always less than that of θ, P ({g i } θ) <P({g i } θ ), (11) such θ is necessarily suboptimal. Similarly, we can prove that θ is suboptimal except for two cases such that only θ [AAAA], θ [ACCC],andθ [CAAC] have positive values or only θ [AAAC], θ [ACCA], and θ [CAAA] have positive values. Therefore, we can assert that the left θs should be exactly zero before running the EM algorithm. 4. Algorithms Next, how can we perform such manipulations? Doing so manually is not realistic because there are usually hundreds of individuals. Our approach consists of three steps. (1) The first determines all borders that correspond to ellipses in the Venn diagram of haplotypes in figure 1. Because a haplotype paired to some advantageous haplotype is also advantageous, not only the borders corresponding to individuals but also new borders are necessary to distinguish these advantageous haplotypes. Intuitively, a new border is a mirror image of a shared haplotype set reflected by some individual. According to our experience, the number of borders usually greatly exceeds that of individuals. (2) The second step locally determines mostly nested territories that correspond to all intersections of arbitrary borders but that cannot be separated by any borders. We call such haplotype sets (territories) leaves. Figure 2 shows examples of leaves. Thenumber 4

(a) (b) (c) (a) Figure 1. Venn diagrams of haplotypes for (a) first, (b) second, and (c) third examples.

[ACAC], [ACAA],... denote haplotypes. The bidirectional arrows denote diplotype configurations for g i.

Venn diagrams of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples.

For (b), [AAAB] is temporarily created as a candidate for a border but is dropped because it intersects some borders, i.e., [AAAB] = [ABBB] [BAAB].

(3) The third step detects loopy graphs and replaces some leaves with haplotypes if necessary.

The graph is usually disconnected, forming several separated subgraphs.

6 (a) (b) (c) (a) Figure 1. Venn diagrams of haplotypes for (a) first, (b) second, and (c) third examples. g 1 : [ABBB] denotes multilocus genotype data from the 1st individual. A denotes major and C denotes minor alleles, while B denotes a heterozygous genotype. [ACAC], [ACAA],... denote haplotypes. The bidirectional arrows denote diplotype configurations for g i. Each ellipse labeled g i includes all haplotypes this individual may have. (b) (c) Figure 2. Venn diagrams of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples. Borders determined by Step 1 are (a) {[ABBB]}, (b) {[ABBB], [BAAB], [ACCB], [CAAB]}, (c) {[ABBB], [BAAB], [BBBB], [BCCB], [CBBB]}. For (b), [AAAB] is temporarily created as a candidate for a border but is dropped because it intersects some borders, i.e., [AAAB] = [ABBB] [BAAB]. Same as [ACCB] and [CAAB] for (c). of leaves is usually much smaller than 2 N when N is large. (3) The third step detects loopy graphs and replaces some leaves with haplotypes if necessary. Briefly, the leaves and individuals can be expressed using a graph (figure 3). Here, graph vertices and edges correspond to leaves and individuals. The graph is usually disconnected, forming several separated subgraphs. For each subgraph, we check whether it has any odd-step loops such as Möbius strips, and if it has, each vertex (leaf ) is replaced with some haplotypes, which are also called leaves. Empirically, such replacements rarely occur but are necessary for an exact maximum likelihood estimator. GrEM is mainly composed of these three steps of manipulation. After these steps, we run the EM algorithm not for all possible haplotypes (haplotypes such as those in figure 1) but for leaves (leaves such as those in figure 4). Although the number of all possible haplotypes may be not tractable, we expect that the number of leaves will be much 5

(a) (b) (c) Figure 3. Graphical expression of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples. [ABBB], [ACCB],... denote vertices of leaves.

For (a), there is a 1-step loop: [ABBB]-g 1 -[ABBB], which is expanded with Step 3 of GrEM. For (b), there are no loops.

Graphical expression of leaves after Step 3 of GrEM for (a) the first, (b) second, and (c) third examples.

7 (a) (b) (c) Figure 3. Graphical expression of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples. [ABBB], [ACCB],... denote vertices of leaves. g i denotes the edge of the ith individual. In each example, the number of subgraphs is only one by chance, but there are usually several disconnected subgraphs for larger data sets. For (a), there is a 1-step loop: [ABBB]-g 1 -[ABBB], which is expanded with Step 3 of GrEM. For (b), there are no loops. For (c), there is a 3-step loop: [ACCB]-g 1 -[AAAB]-g 2 -[CAAB]-g 3 -[ACCB], which is also expanded with Step 3. (a) (b) (c) Figure 4. Graphical expression of leaves after Step 3 of GrEM for (a) the first, (b) second, and (c) third examples. For (a) and (c), the original leaves are replaced by new leaves with the loop expansion in Step 3, while loop expansion did not occur in (b). For the graph in figure 3(c), we can start from [ACCB], go around though [AAAB] and [CAAB], and return back to [ACCB]. However, if we check the haplotype connectivity more closely, this path is in fact impossible. As shown in (c), we start from [ACCA], go around through [AAAC] and [CAAA], and cannot return back to [ACCA]. We have to go further through [ACCC], [AAAA], and [CAAC] to reach the starting haplotype. We can find such a Möbius strip and expand it with Step 3 of GrEM. The calculation cost of the following EM algorithm is roughly in proportion to the number of diplotypes, which has been indicated by the bidirectional arrows here. smaller and tractable. More precisely, the calculation cost of the EM algorithm is in proportion to the number of possible diplotypes, as shown in Eq. (6); we evaluate the proposed method by comparing the numbers of possible diplotypes before and after these steps. Before going into details on the algorithms for the GrEM, let us define some operators, as listed in table 1. Again, h denotes a single haplotype, e.g., [AACC]. g denotes multilocus genotype data from an individual, and we simply call g an individual; e.g., [ABCB] means this individual may have [AACA] and [ACCC] or [AACC] and [ACCA]. b denotes a border ; e.g., [ABCB] means the set of haplotypes, { [AACA], [AACC], [ACCA], [ACCC] }. The set of haplotypes a border can express is limited to the direct product of each genotype at each locus. 6

8 Table 1. Operation tables of,,, and for SNP data. Note: b denotes a border, and g denotes genotype data from a certain individual. All of these operations are performed by locus-wise processes according to the tables above. Superscript (n) denotes the identity of a locus. A denotes major and C denotes minor alleles, while B denotes a heterozygous genotype and Z denotes a missing datum of a genotype; this may be A, C, or B. g := h h h (n) h (n) A C A A B C B C b := g g (n) A C B Z b (n) A C B B b := b b b (n) b (n) A C B A A - A C - C C B A C B b := g b b (n) g (n) A C B A A - A C - C C B C A B Z B B B Algorithm 1 DetermineBorders(G) 1: initialize B := φ, C := φ, andd := φ 2: foreach g G, calculate b := g, and add b to C and D 3: foreach g G, if g has homozygous b then add b to C and D 4: if C = φ then return B 5: remove largest b from C 6: if b = b b b... ( b,b,b... B) then goto 4 7: foreach g Gand g b, calculate b := g b, if b/ Dthen add b to C and D 8: add b to B 9: goto 4 We also define operation tables for these items as shown in table 1. is a binary operator that produces the observed result, g, from two given haplotypes, e.g., [AACC] [ACAC] is [ABBC]. is a unary operator that produces a border from a given g, e.g., [ABCA] is [ABCA]. The result is the same as for the operand for SNP data, but differs for missing values of SNP and CNV data. We will explain the CNV case later. is a binary operator that produces the intersection of two given borders, e.g., [ABBC] [BABB] is [AABC]. is a binary operator that produces a set of paired haplotypes to a given b for g, e.g., [ABBB] [AAAB] is [ACCB]. Note that if b includes unpaired haplotypes for g while g b being calculated, such haplotypes are simply neglected, e.g., [ABBB] [BAAB] is also [ACCB] (refer to figure 2(b)). The first step of the GrEM is as follows, and its specific algorithm is algorithm 1. This is a function whose argument is G, the set of all individuals, and it returns B, the set of determined borders. This function intuitively produces mirror images sequentially. φ denotes the empty set. C denotes the set of candidates for borders that are worth keeping. D denotes the set of processed borders that prevents a border from being added to C more than once at line 7. On line 2, g denotes an ellipse in the Venn diagram corresponding to each individual g i. Homozygous b on line 3 will be explained later in connection to CNV data. The largest b on line 5 denotes the border with the largest number of B s in its sequence, e.g., the first is the largest of [ABBB], [ABCB], and [AACB], because the first has three B s while the others have less than three B s. Line 6 is a cost saver. This means that if a new candidate coincides with any intersection of an arbitrary number of already determined borders, such a candidate is not worth keeping. Line 7 represents the main operation for this function. A mirror image of a border for each individual is added to C. These manipulations are repeated until C has no elements. 7

9 Algorithm 2 DetermineLeaves(B) 1: initialize L := φ and B := φ 2: if B = φ then return L 3: remove largest b from B 4: foreach b B, if b b then remove b from B, and add b to B 5: foreach b Band b b φ, calculate b := b b, and add b to B 6: if B = φ then add b to L else add (call DetermineLeaves(B )) to L 7: goto 2 Figure 5. Example of a tree of borders. A tree is usually more powerful than a set when we check whether a certain border is already included or not. If we introduce a certain order for siblings in a tree, it will be more efficient. The second step of the GrEM is as follows, and its specific algorithm is algorithm 2. This is a recursive function whose argument is B, the result of the first step, and it returns L, the set of determined leaves. Leaves are defined using the set of borders. A leaf must coincide with either some border or the intersection of an arbitrary number of borders, and it must not be separated by any borders. Recall that shared haplotypes and their mirror images have advantages. Leaves coincide with such advantageous haplotype sets. If line 6 in algorithm 1, the cost saver, is removed, the resulting B necessarily includes all leaves because applying the operator,, twice is equivalent to the simple intersection, g (g b) =g b (except for missing values). The approach without the cost saver seems more straightforward, but it usually suffers from numerous borders resulting in memory overflow. On line 3, one of the borders is selected, and the intersection is only calculated inside this border. Such a step is hierarchically repeated on line 6. This approach suppresses the peak number of borders and results in a faster calculation. When all the intersections are checked, this function returns the set of all determined leaves. These two algorithms can be made more efficient if we introduce a tree of borders as shown in figure 5. The construction of this tree is intuitive. Borders are added to the tree in descending order of their cardinalities. We start from the root and check all its children to find whether each child is a proper superset of a new border or not. If there is such a child, we move to the child and check all its children again. We repeat this check until there are no such children. Then, we add a new border to the current border as a child. We also construct such a tree for { g i }. This is because these two algorithms contain numerous inclusion checks, e.g., g b on line 7 of algorithm 1, or b b on line 4 of algorithm 2. Also, the tree of borders is efficient for checking the if-clause on line 6 of algorithm 1. The third step of the GrEM is as follows. First, we group all leaves into several disconnected subgraphs. Each subgraph is checked as to whether it has any odd-step loops or not. More specifically, we arbitrarily select a leaf, l, from the subgraph and assign a specific haplotype, h, to l where h l. Then, we propagate h to the whole subgraph; we find the adjacent vertex l to l by following edge g where l = g l, and assign h := g h to l.ifl has already been assigned the same haplotype, we do not assign it twice, whereas when h is different from the assigned haplotype, we assign h to l in addition to the already assigned one(s). This assignment is repeated throughout the whole subgraph. Finally, we check whether every leaf in the subgraph is only assigned one haplotype or not. If so, we do not have to do anything else. If not, we replace all leaves in the subgraph with assigned haplotypes, which become new leaves instead. 8

10 Table 2. Operation tables of,,, and for CNV data. b denotes a border and g denotes genotype data from a certain individual. All these operations are performed by locus-wise processes according to the tables above. Superscript (n) denotes the identity of a locus. A border has two values, i.e., minimum and maximum numbers of copies, for each locus, which is expressed as 0..x when the minimum and maximum numbers correspond to 0 and x. g := h h h (n) x h (n) y x+ y b := g x 0..x g (n) b (n) b (n) b := b b b (n) x..y w..z max(w, x).. min(z, y) b := g b b (n) x..y z max(0,z y).. max(0,z x) g (n) This algorithm successfully detects odd-step loops in the subgraph. Intuitively, this step detects Möbius strips and expands them. Three examples before this step are shown in figure 3, and those after this step are shown in figure 4. Through Steps 1 and 2, we obtain advantageous sets of haplotypes, but their connectivities sometimes form Möbius strips. Note that if all the cardinalities of leaves are one in a subgraph, this replacement never occurs. Empirically, this replacement rarely occurs. Also, this step does not take a lot of time. We will only give a short explanation here of missing values of SNP data. The missing value denoted by Z may be A, C, or B. The proposed approach successfully manages missing values. More specifically, missing values are dealt with in the same manner as the other genotypes, as listed in table 1. Empirically, missing values greatly increase the number of borders and leaves. Thus, the number of missing values is critical to the calculation cost of the proposed method. Each copy number for each polymorphic locus of a certain individual appears as an arithmetical addition of two copy numbers from each haplotype. In our approach, SNP and CNV have two common properties. These properties are related to the introduced operators,,, and. Operator produces a set of haplotypes that a given individual may have. We can define this operator for CNV as listed in table 2, e.g., [0, 1, 2, 3] is [0..0, 0..1, 0..2, 0..3], where x..y denotes a set of integers from x through y. Similar to SNP data, the set of haplotypes corresponding to a certain individual can also be expressed as the direct product of each locus. This operation can also be calculated with a locus-wise process. Although these properties are trivial, they are important for the proposed method. Operator produces a set of haplotypes that can be paired with shared haplotypes. We can also define this operator for CNV as seen in table 2, e.g., [1, 2, 3, 4] [0..2, 1..2, 0..1, 2..3] is [0..1, 0..1, 2..3, 1..2]. An operation result can also be expressed as a direct product of each locus, and this operation can be performed by using a locus-wise process. Operator produces new borders that coincide with the intersection of two given borders. This operator also retains these properties. The property of the direct product guarantees that a border can be expressed using the size of the O(N) memory, where N denotes the number of polymorphic loci. The property of the locus-wise process guarantees that each operation can be performed with the calculation cost of O(N) if a locus-wise process requires a calculation cost of O(1). Without these properties, we would incur a memory requirement and calculation cost of O(2 N ), which is intractable. These properties also make haplotype inference from mixed SNP and CNV data possible. For example, we can manage such individual data as [A, B, C, 1, 2, 3]. For CNV, a set of copy numbers for each loci of a border is convex and we can express it in the form of x..y by using two integers, i.e., minimum and maximum numbers. Although this property is not crucial, it is advantageous. Without this convexity, we would have to hold a set 9

(a) (b) (c) (d) Figure 6. Example of CNV data. (a) Venn diagram of haplotypes. (b) Venn diagram of leaves. (c) Graphical expression of leaves after Step 2.

Each closed circle denotes the result from an artificial data set: 100 individuals of 1 30 SNP loci including 1% missing values. GrEM successfully reduced the number of diplotypes.

g, [{0, 1, 3}, {2}, {2, 3}, {0, 2, 4}]. If the average cardinality of each set was five, the memory requirement and calculation cost would increase five times.

but not for CNV. Let us consider an individual of g = [2, 2]. g is {[0,0], [0,1], [0,2], [1,0], [1,1], [1,2], [2,0], [2,1], [2,2]}.

The expression of g has homozygous b in algorithm 1 means such a situation. Except for this, CNV data or mixed SNP and CNV data can be dealt with by using the same algorithms.

11 (a) (b) (c) (d) Figure 6. Example of CNV data. (a) Venn diagram of haplotypes. (b) Venn diagram of leaves. (c) Graphical expression of leaves after Step 2. (d) Graphical expression of leaves after Step 3. Figure 7. EM and GrEM results for the number of diplotypes. Each closed circle denotes the result from an artificial data set: 100 individuals of 1 30 SNP loci including 1% missing values. GrEM successfully reduced the number of diplotypes. EM could not handle 7 out of 150 data sets because of memory overflow. Such data sets are plotted on the extreme right of the graph. of copy numbers for each loci, e.g, [{0, 1, 3}, {2}, {2, 3}, {0, 2, 4}]. If the average cardinality of each set was five, the memory requirement and calculation cost would increase five times. One significant difference in the properties of SNP and CNV is that regarding the set of haplotypes a certain individual may have, g; each haplotype in the set is symmetrical for SNP (figure 1(a)) but not for CNV. Let us consider an individual of g = [2, 2]. g is {[0,0], [0,1], [0,2], [1,0], [1,1], [1,2], [2,0], [2,1], [2,2]}. Of the elements in this set, only [1,1] can be a homozygous haplotype (figure 6(a)). This type of homozygous haplotype exists if g does not contain any homozygous genotype B or odd copy numbers, e.g., [A, C, 2, 4]. The expression of g has homozygous b in algorithm 1 means such a situation. Except for this, CNV data or mixed SNP and CNV data can be dealt with by using the same algorithms. Figure 6 shows such one example. 5. Results We first show how much GrEM reduces the calculation cost by using artificial data sets. Figure 7 compares the numbers of diplotypes, which are roughly in proportion to the calculation cost of the EM algorithm. Memory overflow often occurred with the original EM when there were more than 28 SNP loci. However, GrEM successfully reduced the number of diplotypes when there were more than 12 SNP loci. The effect of the cost saver on line 6 in algorithm 1 is plotted in figure 8. The cost saver reduced about 90% of the borders (figure 8(a)), which roughly corresponds to the calculation cost and memory requirement of Step 1. Also, fewer borders reduced the calculation time of 10

Loci denotes the number of polymorphic loci, where 1 means 1 CNV locus, while 12 means 1 CNV and 11 SNP loci. Log-likelihood denotes max θ ln P ({ ˆd i } {g i }, θ). denotes intractable cases.

12 (a) (b) Figure 8. Effect of cost saver on line 6 in algorithm 1. The data sets are the same as those used in figure 7. The cost saver reduced about 90% of the borders andthushelpedtoprevent memory overflow. Table 3. Results for real data sets. Loci denotes the number of polymorphic loci, where 1 means 1 CNV locus, while 12 means 1 CNV and 11 SNP loci. Log-likelihood denotes max θ ln P ({ ˆd i } {g i }, θ). denotes intractable cases. dataset loci number of missing log-likelihood individuals rate % GrEM EM MOCSphaser MRGPRX1 (CEU) MRGPRX1 (YRI) CYP2D6 (CEU) CYP2D6 (YRI) MRGPRX1 (CEU)+SNP MRGPRX1 (YRI)+SNP CYP2D6 (CEU)+SNP CYP2D6 (YRI)+SNP Step 2, as can be seen from figure 8(b). This effect was especially clear when there were many borders. The accuracy of GrEM was tested and validated using real data sets provided by Kato et al. [13]. They were measured by quantitative polymerase chain reaction (PCR) on CYP2D6 and MRGPRX1 genes for individuals of Northern and Western European descent from Utah, USA (CEU) and for individuals of the Yoruba people from Ibadan, Nigeria (YRI) in the HapMap populations. We also prepared mixed data for these experimental CNVs and neighboring SNPs. As shown in table 3, GrEM worked well on both small and large data sets. In regard to the small data sets, the log-likelihoods of the three methods were the same. In regard to the large data sets, memory overflow occurred in the original EM algorithm and the conventional mixture-of- CNV-SNP phaser (MOCSphaser) [13]. These results demonstrated that the proposed method could handle SNP-CNV data and greatly reduce the calculation cost. Concerning the calculation time, GrEM was much faster than MOCSphaser. We prepared 7 middle-sized data sets composed of one CNV locus and 9 14 SNP loci, for which both GrEM and MOCSphaser could perform the inference. GrEM was times faster than MOCSphaser, the median case was 2.5 sec and 28 sec, respectively, although calculation time greatly depends 11

13 on a given data. 6. Discussion The proposed method assumes Hardy-Weinberg equilibrium (HWE) for the generative model. Also, it does not employ recombination or single mutation for the model or the prior distribution. If the population of given data is large, this simple model is considered to be adequate, and it avoids overfitting to the given data. The EM algorithm is a standard method that is used to solve this model. The EM algorithm has serious problem in that it exponentially increases the calculation cost. The proposed method focuses on the symmetrical structure of the model and successfully reduces the calculation cost by using such symmetry. If the population is small or, more extremely, if pedigree data are given, the HWE model is not an adequate way of representing the data. One useful extension of the proposed method would be the ability to impose a blood relationship on the model. Missing values create realistic problems for haplotype inference. The proposed method can manage the missing values for SNP data. More precisely, completely missing values are acceptable, but partially missing values are not acceptable, e.g., for a certain SNP locus, such a case would be where one of the alleles is major and the other allele is unknown. This is because such data are asymmetric and GrEM does not work well with these. The proposed method has currently not been implemented to manage CNV data from ambiguously determined copy numbers. We would like to extend it to handle such data. Single-nucleotide variation in a copy unit (SNVC) is also a big problem with haplotype inference [14]. As CNV involves multiple copies of DNA segments longer than 1 kilobase, and the frequency of SNP is about 0.1%, the CNV region usually contains multiple SNP loci. These SNP data cannot be distinguished from one another but the quantity of each allele can be measured. We would also like to extend the method to SNVC data in the future. Acknowledgments This work was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas (No ) and by the High-Tech Research Center Project of the Japanese Ministry of Education, Culture, Sports, Science and Technology. References [1] The Wellcome Trust Case Control Consortium 2007 Nature [2] Redon R et al 2006 Nature [3] LockeDPet al 2006 Am. J. Hum. Genet [4] McCarroll S et al 2008 Hum. Mol. Genet. 17 R [5] Clark A G 1990 Mol. Biol. Evol [6] Excoffier L and Slatkin M 1995 Mol. Biol. Evol [7] Stephens M, Smith N J and Donnelly P 2001 Am. J. Hum. Genet [8] NiuTH,QinZHS,XuXPandLiuJS2002 Am. J. Hum. Genet [9] Qin Z H S, Niu T H and Liu J S 2002 Am. J. Hum. Genet [10] Stephens M and Donnelly P 2003 Am. J. Hum. Genet [11] Xing E P, Jordan M I and Sharan R 2007 J. Comput. Biol [12] Shindo H, Chigira H, Tanaka J, Kamatani N and Inoue M 2008 J. Hum. Genet [13] Kato M, Nakamura Y and Tsunoda T 2008 Bioinformatics [14] Kato M, Nakamura Y and Tsunoda T 2008 Am. J. Hum. Genet

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control