Grouping preprocess for haplotype inference from SNP and CNV data

Size: px
Start display at page:

Download "Grouping preprocess for haplotype inference from SNP and CNV data"

Transcription

1 Journal of Physics: Conference Series Grouping preprocess for haplotype inference from SNP and CNV data Recent citations - A haplotype inference method based on sparsely connected multi-body ising model Masashi Kato et al To cite this article: Hiroyuki Shindo et al 2009 J. Phys.: Conf. Ser View the article online for updates and enhancements. This content was downloaded from IP address on 14/10/2018 at 18:16

2 Grouping preprocess for haplotype inference from SNP and CNV data Hiroyuki Shindo 1, Hiroshi Chigira 1, Tomoyo Nagaoka 1, Naoyuki Kamatani 2 and Masato Inoue 1 1 Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, 3 4 1, Okubo, Shinjuku-ku, Tokyo , Japan 2 Institute of Rheumatology, Tokyo Women s Medical University, 10 22, Kawada-cho, Shinjuku-ku, Tokyo , Japan masato.inoue@eb.waseda.ac.jp Abstract. The method of statistical haplotype inference is an indispensable technique in the field of medical science. The authors previously reported Hardy-Weinberg equilibrium-based haplotype inference that could manage single nucleotide polymorphism (SNP) data. We recently extended the method to cover copy number variation (CNV) data. Haplotype inference from mixed data is important because SNPs and CNVs are occasionally in linkage disequilibrium. The idea underlying the proposed method is simple, but the algorithm for it needs to be quite elaborate to reduce the calculation cost. Consequently, we have focused on the details on the algorithm in this study. Although the main advantage of the method is accuracy, in that it does not use any approximation, its main disadvantage is still the calculation cost, which is sometimes intractable for large data sets with missing values. 1. Introduction Genome-wide association studies have been revealing the relationships between genetic variations and phenotypic traits. Current high-throughput genotyping technologies have provided us with a number of single nucleotide polymorphisms (SNPs), which could be associated with common diseases such as type II diabetes or rheumatoid arthritis [1]. In addition to SNPs, it has been reported that structural variations such as copy-number variations (CNVs) affect phenotypic traits. CNVs are genetic variations in the copy number of DNA segments in comparison with a reference genome, a copy unit of which ranges from 1 kilobase to several megabases. CNVs are caused by genetic deletion or duplication, and they cover more than 12% of the human genome [2]. Since SNPs and CNVs are in linkage disequilibrium [3, 4], we need to do association analyses of integrated SNP-CNV sequences. Haplotype information is crucial for association tests and other genetic studies because it is more powerful than information about the alleles of a single marker. (A haplotype is a single genetic constituent of an individual chromosome inherited from the father or mother.) Since experimental genotype data are observed as unphased pairs of alleles in SNPs and the total number of segment copies in CNVs of over two chromosomes, the haplotype phase is usually determined by statistical and computational methods. Although various methods of haplotype phasing have been proposed [5, 6, 7, 8, 9, 10, 11], they cannot be directly applied to SNP-CNV haplotype phasing. One of the most difficult c 2009 Ltd 1

3 problems is that the number of possible haplotypes exponentially increases for the number of heterozygous loci. One of the standard methods used for haplotype inference, the EM algorithm [6], also suffers from huge calculation costs. In this paper, we propose a haplotype-phasing algorithm based on the grouping preprocess (GrEM) [12], which can be applied to SNP and CNV genotypes. The GrEM involves a method of efficiently exploring haplotypes whose frequency in population is exactly zero in the maximum likelihood solution without using any approximation to reduce the number of possible haplotypes. From the mathematical aspect, SNPs appear as logical additions of alleles on two chromosomes, while CNVs appear as arithmetical additions. The proposed method can handle mixed SNP and CNV data through an integrated framework, which utilizes the common properties that both SNP and CNV have. We tested our algorithm by applying it to real and artificial unphased genotypes and found that it relaxed limitations on the calculation cost. 2. Model Here, we define the probabilistic model for the data. We assume that the observed data are unphased multilocus genotypes. For example, [A/T, A/C, G/G, 3] represents 4 polymorphic genotypes composed of 3 SNPs and 1 CNV. A/T denotes the heterozygous genotype, i.e., one of the alleles is A and the other is T. 3 denotes the total number of allelic copies over two chromosomes, i.e., the combinations of the numbers can be 0 and 3 or 1 and 2. Here, we assume that the number of the alleles of each SNP locus is two; this is the ordinary situation. As to the copy number of each CNV site, we made no assumptions, so any non-negative integer is acceptable. For simplicity, we shall use A to express a major allele or homozygous genotype of major alleles for each SNP locus. Similarly, C is used for minor allele(s). Incidentally, as our method does not distinguish whether a certain allele is major or minor, A and C are absolutely exchangeable throughout this paper. B is also used for briefly expressing a heterozygous genotype A/C. According to these notations, [A/T, A/C, G/G, 3] is expressed as [B, B, A, 3] or briefly as [BBA3]. First, we define θ as the haplotype frequencies of the population. For example, in the 3 SNPs and 1 CNV case, where the maximum copy number of the CNV locus is 3, θ = (θ [AAA0],θ [AAA1],..., θ [CCC3] )isa dimensional vector. As θ are frequencies, the sum of each element is normalized to 1 ( h θ h 1), where h denotes every possible haplotype. The diplotype configuration for each individual is assumed to follow the Hardy-Weinberg equilibrium (HWE), { θ 2 P (d i θ) di,1 if d i,1 = d i,2, (1) 2θ di,1 θ di,2 if d i,1 d i,2 where d i (d i,1,d i,2 ) (i =1, 2,..., I) denotes the pair of haplotypes for the ith individual and θ di,1 denotes the haplotype frequency for haplotype d i,1. To prevent permutation symmetry, we impose a constraint, d i,1 d i,2,ond i by defining a certain order. The probability for each individual is assumed to be independent. Also, measurement of the genotypes is assumed to include no errors: P ({g i }, {d i } θ) I δ di g i P (d i θ), (2) i where g i denotes the observed unphased genotype data for the ith individual, {g i } denotes the set of data for all individuals, I denotes the number of individuals in a data set, and δ denotes an indicator function, which yields 1 when the given condition is true and 0 otherwise. d i g i denotes that a diplotype, d i, is consistent with the genotype data, g i. 2

4 Our final target is to determine the optimal diplotype configurations, { ˆd i }, such as arg max {di } P ({d i } {g i }) by assuming a certain prior distribution, P (θ), but this discrete optimization problem is usually hard to solve. Instead, we take a two-step approach; we first determine the maximum likelihood estimator of θ: Then, we infer ˆθ arg max P ({g i } θ). (3) θ { ˆd i } arg max {d i } P ({d i} {g i }, ˆθ) (4) using the determined θ. Equation (4) is easy to solve, but Eq. (3) is not. The EM algorithm is an effective approach to solving this problem; it is an iterative procedure to estimate the maximum likelihood estimator by starting randomly with set θ (0) and sequentially calculating θ (t+1) arg max θ P ({d i } {g i }, θ (t) )lnp({g i }, {d i } θ) (5) {d i } under the expectation that θ (t) will converge to the maximum likelihood estimator. The EM algorithm guarantees convergence to a locally optimal but not globally optimal solution. Therefore, several trials with different initial values are carried out, and the best solution obtained from these is employed. Equation (4) can be reduced to θ (t+1) h I θ (t) d i,1 θ (t) d i,2 (δ di,1 =h + δ di,2 =h) (6) i d i g i in this haplotype-inference problem. Thus, the calculation cost is roughly on the order of the total number of possible diplotype configurations satisfying d i g i over all individuals. This calculation cost occasionally becomes intractable because it increases exponentially with the number of heterozygous loci. For example, if the genotype data of a certain individual are [BB...B] with 50 SNP loci, the calculation cost is O( ). 3. Theory The idea behind the GrEM we previously proposed is simple. Our direct task here is to find one of the optimal solutions, ˆθ, which maximizes the following likelihood function under given genotype data {g i }, P ({g i } θ) = I i d i g i P (d i θ). (7) By focusing on the symmetrical structure of this function, the GrEM reduces the calculation cost. Here, we show several simple examples composed of only SNP loci, but this theory is also valid for CNV only data and the mixed data of SNP and CNV. For the first example, let us assume one individual has [ABBB], as shown in figure 1(a). In this case, the likelihood function is given as P ({g i } θ) =(θ [AAAA] θ [ACCC] + θ [AAAC] θ [ACCA] + θ [AACA] θ [ACAC] + θ [AACC] θ [ACAA] ) (8) There are four haplotype pairs in this equation, and obviously, assigning non-zero values to one of them and zero values to the others produces the optimal solution. Actually, 3

5 θ [AAAA] = θ [ACCC] =1/2 is one of the optimal solutions (note that other θs are zero because h θ h 1). In other words, these four haplotype pairs are equivalent to one another because of the symmetry, and only one pair has non-zero probabilities that satisfy the optimality of the likelihood function. We can thus set the probabilities of three arbitrary haplotype pairs to zero while running the EM algorithm. If there are numerous heterozygous loci, this type of manipulation greatly reduces the calculation cost. For the second example, let us assume two individuals have [ABBB] and [BAAB], as shown in figure 1(b). In this case, the likelihood function is given as P ({g i } θ) =(θ [AAAA] θ [ACCC] + θ [AAAC] θ [ACCA] + θ [AACA] θ [ACAC] + θ [AACC] θ [ACAA] ) (θ [AAAA] θ [CAAC] + θ [AAAC] θ [CAAA] ). (9) As a result, there are two optimal solutions, the first is θ [AAAA] =1/2, θ [ACCC] = θ [CAAC] =1/4, and the second is θ [AAAC] =1/2, θ [ACCA] = θ [CAAA] =1/4. As we can easily guess that [AAAA] and [AAAC] are included in both g 1 and g 2, they have advantages. [ACCC], [ACCA], [CAAC], and [CAAA] also have advantages because each of them can be paired with an advantageous one. As the remaining four haplotypes, [ACAC], [ACAA], [AACC], and [AACA] are less advantageous, we can set their probabilities to zero. This type of manipulation also greatly reduces the calculation cost. Similar to the first example, two haplotype combinations { [AAAA], [ACCC], [CAAC] } and { [AAAC], [ACCA], [CAAA] } are symmetrical and equivalent to each other. Thus, we can set the probabilities of one of them to zero. Such optimality can easily be demonstrated through simple equations. Assume θ such that every θ in Eq. (9) is greater than zero. Also, suppose θ such that θ [AAAA] = θ [AAAA] + θ [AAAC] + θ [AACA] + θ [AACC] θ [ACCC] = θ [ACCC] + θ [ACCA] + θ [ACAC] + θ [ACAA] θ [CAAC] = θ [CAAC] + θ [CAAA] θ [AAAC] = θ [AACA] = θ [AACC] = θ [ACCA] = θ [ACAC] = θ [ACAA] = θ [CAAA] =0. (10) Then, as the likelihood of θ is always less than that of θ, P ({g i } θ) <P({g i } θ ), (11) such θ is necessarily suboptimal. Similarly, we can prove that θ is suboptimal except for two cases such that only θ [AAAA], θ [ACCC],andθ [CAAC] have positive values or only θ [AAAC], θ [ACCA], and θ [CAAA] have positive values. Therefore, we can assert that the left θs should be exactly zero before running the EM algorithm. 4. Algorithms Next, how can we perform such manipulations? Doing so manually is not realistic because there are usually hundreds of individuals. Our approach consists of three steps. (1) The first determines all borders that correspond to ellipses in the Venn diagram of haplotypes in figure 1. Because a haplotype paired to some advantageous haplotype is also advantageous, not only the borders corresponding to individuals but also new borders are necessary to distinguish these advantageous haplotypes. Intuitively, a new border is a mirror image of a shared haplotype set reflected by some individual. According to our experience, the number of borders usually greatly exceeds that of individuals. (2) The second step locally determines mostly nested territories that correspond to all intersections of arbitrary borders but that cannot be separated by any borders. We call such haplotype sets (territories) leaves. Figure 2 shows examples of leaves. Thenumber 4

6 (a) (b) (c) (a) Figure 1. Venn diagrams of haplotypes for (a) first, (b) second, and (c) third examples. g 1 : [ABBB] denotes multilocus genotype data from the 1st individual. A denotes major and C denotes minor alleles, while B denotes a heterozygous genotype. [ACAC], [ACAA],... denote haplotypes. The bidirectional arrows denote diplotype configurations for g i. Each ellipse labeled g i includes all haplotypes this individual may have. (b) (c) Figure 2. Venn diagrams of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples. Borders determined by Step 1 are (a) {[ABBB]}, (b) {[ABBB], [BAAB], [ACCB], [CAAB]}, (c) {[ABBB], [BAAB], [BBBB], [BCCB], [CBBB]}. For (b), [AAAB] is temporarily created as a candidate for a border but is dropped because it intersects some borders, i.e., [AAAB] = [ABBB] [BAAB]. Same as [ACCB] and [CAAB] for (c). of leaves is usually much smaller than 2 N when N is large. (3) The third step detects loopy graphs and replaces some leaves with haplotypes if necessary. Briefly, the leaves and individuals can be expressed using a graph (figure 3). Here, graph vertices and edges correspond to leaves and individuals. The graph is usually disconnected, forming several separated subgraphs. For each subgraph, we check whether it has any odd-step loops such as Möbius strips, and if it has, each vertex (leaf ) is replaced with some haplotypes, which are also called leaves. Empirically, such replacements rarely occur but are necessary for an exact maximum likelihood estimator. GrEM is mainly composed of these three steps of manipulation. After these steps, we run the EM algorithm not for all possible haplotypes (haplotypes such as those in figure 1) but for leaves (leaves such as those in figure 4). Although the number of all possible haplotypes may be not tractable, we expect that the number of leaves will be much 5

7 (a) (b) (c) Figure 3. Graphical expression of leaves after Step 2 of GrEM for (a) the first, (b) second, and (c) third examples. [ABBB], [ACCB],... denote vertices of leaves. g i denotes the edge of the ith individual. In each example, the number of subgraphs is only one by chance, but there are usually several disconnected subgraphs for larger data sets. For (a), there is a 1-step loop: [ABBB]-g 1 -[ABBB], which is expanded with Step 3 of GrEM. For (b), there are no loops. For (c), there is a 3-step loop: [ACCB]-g 1 -[AAAB]-g 2 -[CAAB]-g 3 -[ACCB], which is also expanded with Step 3. (a) (b) (c) Figure 4. Graphical expression of leaves after Step 3 of GrEM for (a) the first, (b) second, and (c) third examples. For (a) and (c), the original leaves are replaced by new leaves with the loop expansion in Step 3, while loop expansion did not occur in (b). For the graph in figure 3(c), we can start from [ACCB], go around though [AAAB] and [CAAB], and return back to [ACCB]. However, if we check the haplotype connectivity more closely, this path is in fact impossible. As shown in (c), we start from [ACCA], go around through [AAAC] and [CAAA], and cannot return back to [ACCA]. We have to go further through [ACCC], [AAAA], and [CAAC] to reach the starting haplotype. We can find such a Möbius strip and expand it with Step 3 of GrEM. The calculation cost of the following EM algorithm is roughly in proportion to the number of diplotypes, which has been indicated by the bidirectional arrows here. smaller and tractable. More precisely, the calculation cost of the EM algorithm is in proportion to the number of possible diplotypes, as shown in Eq. (6); we evaluate the proposed method by comparing the numbers of possible diplotypes before and after these steps. Before going into details on the algorithms for the GrEM, let us define some operators, as listed in table 1. Again, h denotes a single haplotype, e.g., [AACC]. g denotes multilocus genotype data from an individual, and we simply call g an individual; e.g., [ABCB] means this individual may have [AACA] and [ACCC] or [AACC] and [ACCA]. b denotes a border ; e.g., [ABCB] means the set of haplotypes, { [AACA], [AACC], [ACCA], [ACCC] }. The set of haplotypes a border can express is limited to the direct product of each genotype at each locus. 6

8 Table 1. Operation tables of,,, and for SNP data. Note: b denotes a border, and g denotes genotype data from a certain individual. All of these operations are performed by locus-wise processes according to the tables above. Superscript (n) denotes the identity of a locus. A denotes major and C denotes minor alleles, while B denotes a heterozygous genotype and Z denotes a missing datum of a genotype; this may be A, C, or B. g := h h h (n) h (n) A C A A B C B C b := g g (n) A C B Z b (n) A C B B b := b b b (n) b (n) A C B A A - A C - C C B A C B b := g b b (n) g (n) A C B A A - A C - C C B C A B Z B B B Algorithm 1 DetermineBorders(G) 1: initialize B := φ, C := φ, andd := φ 2: foreach g G, calculate b := g, and add b to C and D 3: foreach g G, if g has homozygous b then add b to C and D 4: if C = φ then return B 5: remove largest b from C 6: if b = b b b... ( b,b,b... B) then goto 4 7: foreach g Gand g b, calculate b := g b, if b/ Dthen add b to C and D 8: add b to B 9: goto 4 We also define operation tables for these items as shown in table 1. is a binary operator that produces the observed result, g, from two given haplotypes, e.g., [AACC] [ACAC] is [ABBC]. is a unary operator that produces a border from a given g, e.g., [ABCA] is [ABCA]. The result is the same as for the operand for SNP data, but differs for missing values of SNP and CNV data. We will explain the CNV case later. is a binary operator that produces the intersection of two given borders, e.g., [ABBC] [BABB] is [AABC]. is a binary operator that produces a set of paired haplotypes to a given b for g, e.g., [ABBB] [AAAB] is [ACCB]. Note that if b includes unpaired haplotypes for g while g b being calculated, such haplotypes are simply neglected, e.g., [ABBB] [BAAB] is also [ACCB] (refer to figure 2(b)). The first step of the GrEM is as follows, and its specific algorithm is algorithm 1. This is a function whose argument is G, the set of all individuals, and it returns B, the set of determined borders. This function intuitively produces mirror images sequentially. φ denotes the empty set. C denotes the set of candidates for borders that are worth keeping. D denotes the set of processed borders that prevents a border from being added to C more than once at line 7. On line 2, g denotes an ellipse in the Venn diagram corresponding to each individual g i. Homozygous b on line 3 will be explained later in connection to CNV data. The largest b on line 5 denotes the border with the largest number of B s in its sequence, e.g., the first is the largest of [ABBB], [ABCB], and [AACB], because the first has three B s while the others have less than three B s. Line 6 is a cost saver. This means that if a new candidate coincides with any intersection of an arbitrary number of already determined borders, such a candidate is not worth keeping. Line 7 represents the main operation for this function. A mirror image of a border for each individual is added to C. These manipulations are repeated until C has no elements. 7

9 Algorithm 2 DetermineLeaves(B) 1: initialize L := φ and B := φ 2: if B = φ then return L 3: remove largest b from B 4: foreach b B, if b b then remove b from B, and add b to B 5: foreach b Band b b φ, calculate b := b b, and add b to B 6: if B = φ then add b to L else add (call DetermineLeaves(B )) to L 7: goto 2 Figure 5. Example of a tree of borders. A tree is usually more powerful than a set when we check whether a certain border is already included or not. If we introduce a certain order for siblings in a tree, it will be more efficient. The second step of the GrEM is as follows, and its specific algorithm is algorithm 2. This is a recursive function whose argument is B, the result of the first step, and it returns L, the set of determined leaves. Leaves are defined using the set of borders. A leaf must coincide with either some border or the intersection of an arbitrary number of borders, and it must not be separated by any borders. Recall that shared haplotypes and their mirror images have advantages. Leaves coincide with such advantageous haplotype sets. If line 6 in algorithm 1, the cost saver, is removed, the resulting B necessarily includes all leaves because applying the operator,, twice is equivalent to the simple intersection, g (g b) =g b (except for missing values). The approach without the cost saver seems more straightforward, but it usually suffers from numerous borders resulting in memory overflow. On line 3, one of the borders is selected, and the intersection is only calculated inside this border. Such a step is hierarchically repeated on line 6. This approach suppresses the peak number of borders and results in a faster calculation. When all the intersections are checked, this function returns the set of all determined leaves. These two algorithms can be made more efficient if we introduce a tree of borders as shown in figure 5. The construction of this tree is intuitive. Borders are added to the tree in descending order of their cardinalities. We start from the root and check all its children to find whether each child is a proper superset of a new border or not. If there is such a child, we move to the child and check all its children again. We repeat this check until there are no such children. Then, we add a new border to the current border as a child. We also construct such a tree for { g i }. This is because these two algorithms contain numerous inclusion checks, e.g., g b on line 7 of algorithm 1, or b b on line 4 of algorithm 2. Also, the tree of borders is efficient for checking the if-clause on line 6 of algorithm 1. The third step of the GrEM is as follows. First, we group all leaves into several disconnected subgraphs. Each subgraph is checked as to whether it has any odd-step loops or not. More specifically, we arbitrarily select a leaf, l, from the subgraph and assign a specific haplotype, h, to l where h l. Then, we propagate h to the whole subgraph; we find the adjacent vertex l to l by following edge g where l = g l, and assign h := g h to l.ifl has already been assigned the same haplotype, we do not assign it twice, whereas when h is different from the assigned haplotype, we assign h to l in addition to the already assigned one(s). This assignment is repeated throughout the whole subgraph. Finally, we check whether every leaf in the subgraph is only assigned one haplotype or not. If so, we do not have to do anything else. If not, we replace all leaves in the subgraph with assigned haplotypes, which become new leaves instead. 8

10 Table 2. Operation tables of,,, and for CNV data. b denotes a border and g denotes genotype data from a certain individual. All these operations are performed by locus-wise processes according to the tables above. Superscript (n) denotes the identity of a locus. A border has two values, i.e., minimum and maximum numbers of copies, for each locus, which is expressed as 0..x when the minimum and maximum numbers correspond to 0 and x. g := h h h (n) x h (n) y x+ y b := g x 0..x g (n) b (n) b (n) b := b b b (n) x..y w..z max(w, x).. min(z, y) b := g b b (n) x..y z max(0,z y).. max(0,z x) g (n) This algorithm successfully detects odd-step loops in the subgraph. Intuitively, this step detects Möbius strips and expands them. Three examples before this step are shown in figure 3, and those after this step are shown in figure 4. Through Steps 1 and 2, we obtain advantageous sets of haplotypes, but their connectivities sometimes form Möbius strips. Note that if all the cardinalities of leaves are one in a subgraph, this replacement never occurs. Empirically, this replacement rarely occurs. Also, this step does not take a lot of time. We will only give a short explanation here of missing values of SNP data. The missing value denoted by Z may be A, C, or B. The proposed approach successfully manages missing values. More specifically, missing values are dealt with in the same manner as the other genotypes, as listed in table 1. Empirically, missing values greatly increase the number of borders and leaves. Thus, the number of missing values is critical to the calculation cost of the proposed method. Each copy number for each polymorphic locus of a certain individual appears as an arithmetical addition of two copy numbers from each haplotype. In our approach, SNP and CNV have two common properties. These properties are related to the introduced operators,,, and. Operator produces a set of haplotypes that a given individual may have. We can define this operator for CNV as listed in table 2, e.g., [0, 1, 2, 3] is [0..0, 0..1, 0..2, 0..3], where x..y denotes a set of integers from x through y. Similar to SNP data, the set of haplotypes corresponding to a certain individual can also be expressed as the direct product of each locus. This operation can also be calculated with a locus-wise process. Although these properties are trivial, they are important for the proposed method. Operator produces a set of haplotypes that can be paired with shared haplotypes. We can also define this operator for CNV as seen in table 2, e.g., [1, 2, 3, 4] [0..2, 1..2, 0..1, 2..3] is [0..1, 0..1, 2..3, 1..2]. An operation result can also be expressed as a direct product of each locus, and this operation can be performed by using a locus-wise process. Operator produces new borders that coincide with the intersection of two given borders. This operator also retains these properties. The property of the direct product guarantees that a border can be expressed using the size of the O(N) memory, where N denotes the number of polymorphic loci. The property of the locus-wise process guarantees that each operation can be performed with the calculation cost of O(N) if a locus-wise process requires a calculation cost of O(1). Without these properties, we would incur a memory requirement and calculation cost of O(2 N ), which is intractable. These properties also make haplotype inference from mixed SNP and CNV data possible. For example, we can manage such individual data as [A, B, C, 1, 2, 3]. For CNV, a set of copy numbers for each loci of a border is convex and we can express it in the form of x..y by using two integers, i.e., minimum and maximum numbers. Although this property is not crucial, it is advantageous. Without this convexity, we would have to hold a set 9

11 (a) (b) (c) (d) Figure 6. Example of CNV data. (a) Venn diagram of haplotypes. (b) Venn diagram of leaves. (c) Graphical expression of leaves after Step 2. (d) Graphical expression of leaves after Step 3. Figure 7. EM and GrEM results for the number of diplotypes. Each closed circle denotes the result from an artificial data set: 100 individuals of 1 30 SNP loci including 1% missing values. GrEM successfully reduced the number of diplotypes. EM could not handle 7 out of 150 data sets because of memory overflow. Such data sets are plotted on the extreme right of the graph. of copy numbers for each loci, e.g, [{0, 1, 3}, {2}, {2, 3}, {0, 2, 4}]. If the average cardinality of each set was five, the memory requirement and calculation cost would increase five times. One significant difference in the properties of SNP and CNV is that regarding the set of haplotypes a certain individual may have, g; each haplotype in the set is symmetrical for SNP (figure 1(a)) but not for CNV. Let us consider an individual of g = [2, 2]. g is {[0,0], [0,1], [0,2], [1,0], [1,1], [1,2], [2,0], [2,1], [2,2]}. Of the elements in this set, only [1,1] can be a homozygous haplotype (figure 6(a)). This type of homozygous haplotype exists if g does not contain any homozygous genotype B or odd copy numbers, e.g., [A, C, 2, 4]. The expression of g has homozygous b in algorithm 1 means such a situation. Except for this, CNV data or mixed SNP and CNV data can be dealt with by using the same algorithms. Figure 6 shows such one example. 5. Results We first show how much GrEM reduces the calculation cost by using artificial data sets. Figure 7 compares the numbers of diplotypes, which are roughly in proportion to the calculation cost of the EM algorithm. Memory overflow often occurred with the original EM when there were more than 28 SNP loci. However, GrEM successfully reduced the number of diplotypes when there were more than 12 SNP loci. The effect of the cost saver on line 6 in algorithm 1 is plotted in figure 8. The cost saver reduced about 90% of the borders (figure 8(a)), which roughly corresponds to the calculation cost and memory requirement of Step 1. Also, fewer borders reduced the calculation time of 10

12 (a) (b) Figure 8. Effect of cost saver on line 6 in algorithm 1. The data sets are the same as those used in figure 7. The cost saver reduced about 90% of the borders andthushelpedtoprevent memory overflow. Table 3. Results for real data sets. Loci denotes the number of polymorphic loci, where 1 means 1 CNV locus, while 12 means 1 CNV and 11 SNP loci. Log-likelihood denotes max θ ln P ({ ˆd i } {g i }, θ). denotes intractable cases. dataset loci number of missing log-likelihood individuals rate % GrEM EM MOCSphaser MRGPRX1 (CEU) MRGPRX1 (YRI) CYP2D6 (CEU) CYP2D6 (YRI) MRGPRX1 (CEU)+SNP MRGPRX1 (YRI)+SNP CYP2D6 (CEU)+SNP CYP2D6 (YRI)+SNP Step 2, as can be seen from figure 8(b). This effect was especially clear when there were many borders. The accuracy of GrEM was tested and validated using real data sets provided by Kato et al. [13]. They were measured by quantitative polymerase chain reaction (PCR) on CYP2D6 and MRGPRX1 genes for individuals of Northern and Western European descent from Utah, USA (CEU) and for individuals of the Yoruba people from Ibadan, Nigeria (YRI) in the HapMap populations. We also prepared mixed data for these experimental CNVs and neighboring SNPs. As shown in table 3, GrEM worked well on both small and large data sets. In regard to the small data sets, the log-likelihoods of the three methods were the same. In regard to the large data sets, memory overflow occurred in the original EM algorithm and the conventional mixture-of- CNV-SNP phaser (MOCSphaser) [13]. These results demonstrated that the proposed method could handle SNP-CNV data and greatly reduce the calculation cost. Concerning the calculation time, GrEM was much faster than MOCSphaser. We prepared 7 middle-sized data sets composed of one CNV locus and 9 14 SNP loci, for which both GrEM and MOCSphaser could perform the inference. GrEM was times faster than MOCSphaser, the median case was 2.5 sec and 28 sec, respectively, although calculation time greatly depends 11

13 on a given data. 6. Discussion The proposed method assumes Hardy-Weinberg equilibrium (HWE) for the generative model. Also, it does not employ recombination or single mutation for the model or the prior distribution. If the population of given data is large, this simple model is considered to be adequate, and it avoids overfitting to the given data. The EM algorithm is a standard method that is used to solve this model. The EM algorithm has serious problem in that it exponentially increases the calculation cost. The proposed method focuses on the symmetrical structure of the model and successfully reduces the calculation cost by using such symmetry. If the population is small or, more extremely, if pedigree data are given, the HWE model is not an adequate way of representing the data. One useful extension of the proposed method would be the ability to impose a blood relationship on the model. Missing values create realistic problems for haplotype inference. The proposed method can manage the missing values for SNP data. More precisely, completely missing values are acceptable, but partially missing values are not acceptable, e.g., for a certain SNP locus, such a case would be where one of the alleles is major and the other allele is unknown. This is because such data are asymmetric and GrEM does not work well with these. The proposed method has currently not been implemented to manage CNV data from ambiguously determined copy numbers. We would like to extend it to handle such data. Single-nucleotide variation in a copy unit (SNVC) is also a big problem with haplotype inference [14]. As CNV involves multiple copies of DNA segments longer than 1 kilobase, and the frequency of SNP is about 0.1%, the CNV region usually contains multiple SNP loci. These SNP data cannot be distinguished from one another but the quantity of each allele can be measured. We would also like to extend the method to SNVC data in the future. Acknowledgments This work was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas (No ) and by the High-Tech Research Center Project of the Japanese Ministry of Education, Culture, Sports, Science and Technology. References [1] The Wellcome Trust Case Control Consortium 2007 Nature [2] Redon R et al 2006 Nature [3] LockeDPet al 2006 Am. J. Hum. Genet [4] McCarroll S et al 2008 Hum. Mol. Genet. 17 R [5] Clark A G 1990 Mol. Biol. Evol [6] Excoffier L and Slatkin M 1995 Mol. Biol. Evol [7] Stephens M, Smith N J and Donnelly P 2001 Am. J. Hum. Genet [8] NiuTH,QinZHS,XuXPandLiuJS2002 Am. J. Hum. Genet [9] Qin Z H S, Niu T H and Liu J S 2002 Am. J. Hum. Genet [10] Stephens M and Donnelly P 2003 Am. J. Hum. Genet [11] Xing E P, Jordan M I and Sharan R 2007 J. Comput. Biol [12] Shindo H, Chigira H, Tanaka J, Kamatani N and Inoue M 2008 J. Hum. Genet [13] Kato M, Nakamura Y and Tsunoda T 2008 Bioinformatics [14] Kato M, Nakamura Y and Tsunoda T 2008 Am. J. Hum. Genet

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control

More information

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie SOLOMON: Parentage Analysis 1 Corresponding author: Mark Christie christim@science.oregonstate.edu SOLOMON: Parentage Analysis 2 Table of Contents: Installing SOLOMON on Windows/Linux Pg. 3 Installing

More information

On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony

On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony Konstantinos Kalpakis, and Parag Namjoshi Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

Estimating. Local Ancestry in admixed Populations (LAMP)

Estimating. Local Ancestry in admixed Populations (LAMP) Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number

More information

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract)

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Koichiro Doi 1, Jing Li 2, and Tao Jiang 2 1 Department of Computer Science Graduate School of Information Science and

More information

Genetic Analysis. Page 1

Genetic Analysis. Page 1 Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced

More information

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 SNP HiTLink Manual Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 1 Department of Neurology, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan 2 Dynacom Co., Ltd, Kanagawa,

More information

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010 BEAGLECALL 1.0 Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington 15 November 2010 BEAGLECALL 1.0 P a g e i Contents 1 Introduction... 1 1.1 Citing BEAGLECALL...

More information

Documentation for BayesAss 1.3

Documentation for BayesAss 1.3 Documentation for BayesAss 1.3 Program Description BayesAss is a program that estimates recent migration rates between populations using MCMC. It also estimates each individual s immigrant ancestry, the

More information

HPC methods for hidden Markov models (HMMs) in population genetics

HPC methods for hidden Markov models (HMMs) in population genetics HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013 Outline Background

More information

A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals

A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals GUI Release v1.0.2: User Manual January 2009 If you find this software useful, please

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Network Based Models For Analysis of SNPs Yalta Opt

Network Based Models For Analysis of SNPs Yalta Opt Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department

More information

Step-by-Step Guide to Relatedness and Association Mapping Contents

Step-by-Step Guide to Relatedness and Association Mapping Contents Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Step-by-Step Guide to Advanced Genetic Analysis

Step-by-Step Guide to Advanced Genetic Analysis Step-by-Step Guide to Advanced Genetic Analysis Page 1 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

GWAsimulator: A rapid whole-genome simulation program

GWAsimulator: A rapid whole-genome simulation program GWAsimulator: A rapid whole-genome simulation program Version 1.1 Chun Li and Mingyao Li September 21, 2007 (revised October 9, 2007) 1. Introduction...1 2. Download and compile the program...2 3. Input

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/32015 holds various files of this Leiden University dissertation. Author: Akker, Erik Ben van den Title: Computational biology in human aging : an omics

More information

HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data

HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data Introduction The suite of programs, HapBlock, is developed

More information

Estimation of haplotypes

Estimation of haplotypes Estimation of haplotypes Cavan Reilly October 4, 2013 Table of contents Estimating haplotypes with the EM algorithm Individual level haplotypes Testing for differences in haplotype frequency Using the

More information

Mutations for Permutations

Mutations for Permutations Mutations for Permutations Insert mutation: Pick two allele values at random Move the second to follow the first, shifting the rest along to accommodate Note: this preserves most of the order and adjacency

More information

2.2 Syntax Definition

2.2 Syntax Definition 42 CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions

More information

Midterm 2 Solutions. CS70 Discrete Mathematics and Probability Theory, Spring 2009

Midterm 2 Solutions. CS70 Discrete Mathematics and Probability Theory, Spring 2009 CS70 Discrete Mathematics and Probability Theory, Spring 2009 Midterm 2 Solutions Note: These solutions are not necessarily model answers. Rather, they are designed to be tutorial in nature, and sometimes

More information

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1 Haplotype Analysis Specifies the genetic information descending through a pedigree Useful visualization of the gene flow through a pedigree A haplotype for a given individual and set of loci is defined

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Loopy Belief Propagation

Loopy Belief Propagation Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure

More information

The Lander-Green Algorithm in Practice. Biostatistics 666

The Lander-Green Algorithm in Practice. Biostatistics 666 The Lander-Green Algorithm in Practice Biostatistics 666 Last Lecture: Lander-Green Algorithm More general definition for I, the "IBD vector" Probability of genotypes given IBD vector Transition probabilities

More information

Chapter 14 Global Search Algorithms

Chapter 14 Global Search Algorithms Chapter 14 Global Search Algorithms An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Introduction We discuss various search methods that attempts to search throughout the entire feasible set.

More information

PhD: a web database application for phenotype data management

PhD: a web database application for phenotype data management Bioinformatics Advance Access published June 28, 2005 The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org PhD:

More information

USER S MANUAL FOR THE AMaCAID PROGRAM

USER S MANUAL FOR THE AMaCAID PROGRAM USER S MANUAL FOR THE AMaCAID PROGRAM TABLE OF CONTENTS Introduction How to download and install R Folder Data The three AMaCAID models - Model 1 - Model 2 - Model 3 - Processing times Changing directory

More information

User Manual ixora: Exact haplotype inferencing and trait association

User Manual ixora: Exact haplotype inferencing and trait association User Manual ixora: Exact haplotype inferencing and trait association June 27, 2013 Contents 1 ixora: Exact haplotype inferencing and trait association 2 1.1 Introduction.............................. 2

More information

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011 GMDR User Manual GMDR software Beta 0.9 Updated March 2011 1 As an open source project, the source code of GMDR is published and made available to the public, enabling anyone to copy, modify and redistribute

More information

Haplotyping (PPH) Problem

Haplotyping (PPH) Problem A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, and Dan Gusfield Department of Computer Science, University of California, Davis, CA 95616, USA

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

The Structure of Bull-Free Perfect Graphs

The Structure of Bull-Free Perfect Graphs The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint

More information

EFFICIENT HAPLOTYPE INFERENCE FROM PEDIGREES WITH MISSING DATA USING LINEAR SYSTEMS WITH DISJOINT-SET DATA STRUCTURES

EFFICIENT HAPLOTYPE INFERENCE FROM PEDIGREES WITH MISSING DATA USING LINEAR SYSTEMS WITH DISJOINT-SET DATA STRUCTURES 1 EFFICIENT HAPLOTYPE INFERENCE FROM PEDIGREES WITH MISSING DATA USING LINEAR SYSTEMS WITH DISJOINT-SET DATA STRUCTURES Xin Li and Jing Li Department of Electrical Engineering and Computer Science, Case

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Chapter 3. Errors and numerical stability

Chapter 3. Errors and numerical stability Chapter 3 Errors and numerical stability 1 Representation of numbers Binary system : micro-transistor in state off 0 on 1 Smallest amount of stored data bit Object in memory chain of 1 and 0 10011000110101001111010010100010

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Graph theory - solutions to problem set 1

Graph theory - solutions to problem set 1 Graph theory - solutions to problem set 1 1. (a) Is C n a subgraph of K n? Exercises (b) For what values of n and m is K n,n a subgraph of K m? (c) For what n is C n a subgraph of K n,n? (a) Yes! (you

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Genetic Fourier Descriptor for the Detection of Rotational Symmetry

Genetic Fourier Descriptor for the Detection of Rotational Symmetry 1 Genetic Fourier Descriptor for the Detection of Rotational Symmetry Raymond K. K. Yip Department of Information and Applied Technology, Hong Kong Institute of Education 10 Lo Ping Road, Tai Po, New Territories,

More information

{ 1} Definitions. 10. Extremal graph theory. Problem definition Paths and cycles Complete subgraphs

{ 1} Definitions. 10. Extremal graph theory. Problem definition Paths and cycles Complete subgraphs Problem definition Paths and cycles Complete subgraphs 10. Extremal graph theory 10.1. Definitions Let us examine the following forbidden subgraph problems: At most how many edges are in a graph of order

More information

CSCI2100B Data Structures Trees

CSCI2100B Data Structures Trees CSCI2100B Data Structures Trees Irwin King king@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~king Department of Computer Science & Engineering The Chinese University of Hong Kong Introduction General Tree

More information

The Parameterized Complexity of the Rainbow Subgraph Problem. Falk Hüffner, Christian Komusiewicz *, Rolf Niedermeier and Martin Rötzschke

The Parameterized Complexity of the Rainbow Subgraph Problem. Falk Hüffner, Christian Komusiewicz *, Rolf Niedermeier and Martin Rötzschke Algorithms 2015, 8, 60-81; doi:10.3390/a8010060 OPEN ACCESS algorithms ISSN 1999-4893 www.mdpi.com/journal/algorithms Article The Parameterized Complexity of the Rainbow Subgraph Problem Falk Hüffner,

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

4/4/16 Comp 555 Spring

4/4/16 Comp 555 Spring 4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering

More information

Graphical Models. David M. Blei Columbia University. September 17, 2014

Graphical Models. David M. Blei Columbia University. September 17, 2014 Graphical Models David M. Blei Columbia University September 17, 2014 These lecture notes follow the ideas in Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. In addition,

More information

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ)

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Copyright (c) 2018 Stanley Hooker, Biao Li, Di Zhang and Suzanne M. Leal Purpose PLINK/SEQ (PSEQ) is an open-source C/C++ library for working

More information

An Eternal Domination Problem in Grids

An Eternal Domination Problem in Grids Theory and Applications of Graphs Volume Issue 1 Article 2 2017 An Eternal Domination Problem in Grids William Klostermeyer University of North Florida, klostermeyer@hotmail.com Margaret-Ellen Messinger

More information

Exact Sampling for Hardy- Weinberg Equilibrium

Exact Sampling for Hardy- Weinberg Equilibrium Exact Sampling for Hardy- Weinberg Equilibrium Mark Huber Dept. of Mathematics and Institute of Statistics and Decision Sciences Duke University mhuber@math.duke.edu www.math.duke.edu/~mhuber Joint work

More information

Using Genetic Algorithms to Solve the Box Stacking Problem

Using Genetic Algorithms to Solve the Box Stacking Problem Using Genetic Algorithms to Solve the Box Stacking Problem Jenniffer Estrada, Kris Lee, Ryan Edgar October 7th, 2010 Abstract The box stacking or strip stacking problem is exceedingly difficult to solve

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

CHAPTER 1: INTRODUCTION...

CHAPTER 1: INTRODUCTION... Linkage Analysis Package User s Guide to Analysis Programs Version 5.10 for IBM PC/compatibles 10 Oct 1996, updated 2 November 2013 Table of Contents CHAPTER 1: INTRODUCTION... 1 1.0 OVERVIEW... 1 1.1

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Bayesian analysis of genetic population structure using BAPS: Exercises

Bayesian analysis of genetic population structure using BAPS: Exercises Bayesian analysis of genetic population structure using BAPS: Exercises p S u k S u p u,s S, Jukka Corander Department of Mathematics, Åbo Akademi University, Finland Exercise 1: Clustering of groups of

More information

User Manual for GIGI v1.06.1

User Manual for GIGI v1.06.1 1 User Manual for GIGI v1.06.1 Author: Charles Y K Cheung [cykc@uw.edu] Ellen M Wijsman [wijsman@uw.edu] Department of Biostatistics University of Washington Last Modified on 1/31/2015 2 Contents Introduction...

More information

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Chapter 3. Set Theory. 3.1 What is a Set?

Chapter 3. Set Theory. 3.1 What is a Set? Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any

More information

Literature Review On Implementing Binary Knapsack problem

Literature Review On Implementing Binary Knapsack problem Literature Review On Implementing Binary Knapsack problem Ms. Niyati Raj, Prof. Jahnavi Vitthalpura PG student Department of Information Technology, L.D. College of Engineering, Ahmedabad, India Assistant

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other

More information

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not.

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Decision Problems Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Definition: The class of problems that can be solved by polynomial-time

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Genetic type 1 Error Calculator (GEC)

Genetic type 1 Error Calculator (GEC) Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development

More information

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel Breeding Guide Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel www.phenome-netwoks.com Contents PHENOME ONE - INTRODUCTION... 3 THE PHENOME ONE LAYOUT... 4 THE JOBS ICON...

More information

Genetic Algorithms. Kang Zheng Karl Schober

Genetic Algorithms. Kang Zheng Karl Schober Genetic Algorithms Kang Zheng Karl Schober Genetic algorithm What is Genetic algorithm? A genetic algorithm (or GA) is a search technique used in computing to find true or approximate solutions to optimization

More information

Throughout this course, we use the terms vertex and node interchangeably.

Throughout this course, we use the terms vertex and node interchangeably. Chapter Vertex Coloring. Introduction Vertex coloring is an infamous graph theory problem. It is also a useful toy example to see the style of this course already in the first lecture. Vertex coloring

More information

2 The Fractional Chromatic Gap

2 The Fractional Chromatic Gap C 1 11 2 The Fractional Chromatic Gap As previously noted, for any finite graph. This result follows from the strong duality of linear programs. Since there is no such duality result for infinite linear

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Bits, Words, and Integers

Bits, Words, and Integers Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are

More information

Horn Formulae. CS124 Course Notes 8 Spring 2018

Horn Formulae. CS124 Course Notes 8 Spring 2018 CS124 Course Notes 8 Spring 2018 In today s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it

More information

Assignment 4 Solutions of graph problems

Assignment 4 Solutions of graph problems Assignment 4 Solutions of graph problems 1. Let us assume that G is not a cycle. Consider the maximal path in the graph. Let the end points of the path be denoted as v 1, v k respectively. If either of

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path.

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path. Chapter 3 Trees Section 3. Fundamental Properties of Trees Suppose your city is planning to construct a rapid rail system. They want to construct the most economical system possible that will meet the

More information

4 Fractional Dimension of Posets from Trees

4 Fractional Dimension of Posets from Trees 57 4 Fractional Dimension of Posets from Trees In this last chapter, we switch gears a little bit, and fractionalize the dimension of posets We start with a few simple definitions to develop the language

More information

LECTURE 3 ALGORITHM DESIGN PARADIGMS

LECTURE 3 ALGORITHM DESIGN PARADIGMS LECTURE 3 ALGORITHM DESIGN PARADIGMS Introduction Algorithm Design Paradigms: General approaches to the construction of efficient solutions to problems. Such methods are of interest because: They provide

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Chapter 9 Graph Algorithms 2 Introduction graph theory useful in practice represent many real-life problems can be if not careful with data structures 3 Definitions an undirected graph G = (V, E) is a

More information

Bipartite Roots of Graphs

Bipartite Roots of Graphs Bipartite Roots of Graphs Lap Chi Lau Department of Computer Science University of Toronto Graph H is a root of graph G if there exists a positive integer k such that x and y are adjacent in G if and only

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

The Structure and Properties of Clique Graphs of Regular Graphs

The Structure and Properties of Clique Graphs of Regular Graphs The University of Southern Mississippi The Aquila Digital Community Master's Theses 1-014 The Structure and Properties of Clique Graphs of Regular Graphs Jan Burmeister University of Southern Mississippi

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Spotter Documentation Version 0.5, Released 4/12/2010

Spotter Documentation Version 0.5, Released 4/12/2010 Spotter Documentation Version 0.5, Released 4/12/2010 Purpose Spotter is a program for delineating an association signal from a genome wide association study using features such as recombination rates,

More information

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes Mary Melekopoglou Anne Condon Computer Sciences Department University of Wisconsin - Madison 0 West Dayton Street Madison,

More information

Finding k-paths in Cycle Free Graphs

Finding k-paths in Cycle Free Graphs Finding k-paths in Cycle Free Graphs Aviv Reznik Under the Supervision of Professor Oded Goldreich Department of Computer Science and Applied Mathematics Weizmann Institute of Science Submitted for the

More information

2.2 Set Operations. Introduction DEFINITION 1. EXAMPLE 1 The union of the sets {1, 3, 5} and {1, 2, 3} is the set {1, 2, 3, 5}; that is, EXAMPLE 2

2.2 Set Operations. Introduction DEFINITION 1. EXAMPLE 1 The union of the sets {1, 3, 5} and {1, 2, 3} is the set {1, 2, 3, 5}; that is, EXAMPLE 2 2.2 Set Operations 127 2.2 Set Operations Introduction Two, or more, sets can be combined in many different ways. For instance, starting with the set of mathematics majors at your school and the set of

More information

Aston Hall s A-Z of mathematical terms

Aston Hall s A-Z of mathematical terms Aston Hall s A-Z of mathematical terms The following guide is a glossary of mathematical terms, covering the concepts children are taught in FS2, KS1 and KS2. This may be useful to clear up any homework

More information

Superconcentrators of depth 2 and 3; odd levels help (rarely)

Superconcentrators of depth 2 and 3; odd levels help (rarely) Superconcentrators of depth 2 and 3; odd levels help (rarely) Noga Alon Bellcore, Morristown, NJ, 07960, USA and Department of Mathematics Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv

More information

Edge and local feature detection - 2. Importance of edge detection in computer vision

Edge and local feature detection - 2. Importance of edge detection in computer vision Edge and local feature detection Gradient based edge detection Edge detection by function fitting Second derivative edge detectors Edge linking and the construction of the chain graph Edge and local feature

More information