Statistical relationship discovery in SNP data using Bayesian networks

Size: px

Start display at page:

Download "Statistical relationship discovery in SNP data using Bayesian networks"

Shonda Richard
6 years ago
Views:

1 Statistical relationship discovery in SNP data using Bayesian networks Pawe l Szlendak and Robert M. Nowak Institute of Electronic Systems, Warsaw University of Technology, Nowowiejska 5/9, -665 Warsaw, Poland ABSTRACT The aim of this article is to present an application of Bayesian networks for discovery of affinity relationships based on genetic data. The presented solution uses a search and score algorithm to discover the Bayesian network structure which best fits the data i.e. the alleles of single nucleotide polymorphisms detected by DNA microarrays. The algorithm perceives structure learning as a combinatorial optimization problem. It is a randomized local search algorithm, which uses a Bayesian-Dirichlet scoring function. The algorithm s testing procedure encompasses tests on synthetic data, generated from given Bayesian networks by a forward sampling procedure as well as tests on real-world genetic data. The comparison of Bayesian networks generated by the application and the genetic evidence data confirms the usability of the presented methods. Keywords: bioinformatics, Bayesian network, single nucleotide polymorphism. INTRODUCTION Bayesian networks belong to the family of graphical models which combine probability theory and graph theory. They model statistical relationships of attributes in data (perceived as random variables) using a directed acyclic graph (DAG) to visualize the relationships. Data (we considered discrete data with no missing values nor latent variables) is seen as a finite set of random variables X = {X, X 2,..., X n } where each X i may take on value x i from the finite and discrete domain. Formally, a Bayesian network is defined as a pair (G, P ), where G is a DAG and P is a joint probability distribution of random variables from X. Nodes in G represent the variables of X and edges express the relationships between these random variables. Depending on the domain being modeled the directionality of a particular arc can be sometimes treated as a causal relationship. A Bayesian network satisfies a Markov condition, which says that a node in a Bayesian network is independent of its nondescendents given its parents. Markov condition enables factorization of joint probability distribution of random variables in a form given by equation. P (X, X 2..., X n ) = n P (X i PA Xi ) () Thus, a joint probability distribution is expressed as a product of conditional distributions of all nodes given values of their parents in G. To specify () it is required to determine each conditional probability of X i given the set of its parents. In case of X i taking on discrete values the conditional probabilities P (X i PA Xi ) are represented in form of conditional probability tables (CPTs), which consist of conditional distributions P (x i pa Xi ) for each possible value x i of X i and each possible configuration of parents values pa Xi of PA Xi. Owning to the factorization in equation, the complexity of data analysis is significantly reduced. The joint probability distribution P can be equally well expressed by more than one DAG, therefore (Markov) equivalence classes of DAGs are defined. Two DAGs belong to the same Markov equivalent class if and only if Further author information: Pawe l Szlendak: P.Szlendak@stud.elka.pw.edu.pl Robert M. Nowak: R.M.Nowak@elka.pw.edu.pl i=

2 they have the same skeletons (edges with disregard to directions) and the same set of v-structures, that is, two converging arrows whose tails are not connected by an arrow. 2 A Markov equivalence class could be represented as a graph, such representation is called PDAG (pattern DAG). The problem tackled in this paper was to discover the Bayesian network (more precisely the PDAG) which best fits the given genetic data. The nodes in the network represented individuals and the arcs expressed the relationships which (being statistically important) might be interesting from point of view of medicine, forensic science or biology. The analyzed genetic data referred the variations in DNA sequence of the human genome. The genetic information is stored in DNA molecules. A DNA molecule is a chain built from 4 nucleotides: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T), of length depending on an organism, e.g. the human genome is 3 9 nucleotide long. The 99% of DNA sequence is identical among individuals of a given species. The most frequently variation of DNA is single nucleotide polymorphism (SNP) that occur when a single nucleotide in the genome sequence is altered. The number of SNPs in the human genome is estimated to be 3 6, 3 and these variations are evolutionary stable (not changing much from generation to generation), thus they are useful in medicine and forensic science. To detect SNPs microarray technology is used, capable of identifying about 4 variants in one step. Genetic data, usually taken into consideration in forensic science or medicine includes those parts of DNA code that varies among individuals. A place in a DNA sequence, where a particular genetic information is stored, is called a locus, the DNA string stored at locus is called gene. Different substrings of DNA that may occur at given locus are called variants or alleles. In diploid organisms (e.g. human) majority of the genes come in two copies (inherited from a female and a male). Growing availability of genetic data owning to technology development caused many genetic problems to be expressed in terms of graphical models, since the graphical models are the natural way to express the relationships between individuals. 4 The problems range from identifying genes causing particular diseases, discovering affinity relationships within a group of individuals or examining cell life cycle. Figure shows a pedigree diagram commonly used in genetics and a graph which shows the same relationships on a DAG. b c b c b b a c b c b b a c b c a c a c a a Figure. Traditional pedigree diagram vs. pedigree DAG. Females are represented by circles, males by squares. The letters indicate particular alleles of an individual at a given locus. The problem, which extends the traditional assessment of alleles given parent-child relationships, is that of establishing unknown affinity relationships for a group of individuals based on SNPs, e.g. identifying multiple remains from disasters, wars. Specialized algorithms have been developed for the analysis of genetic linkage with graphical models. 5, 6 The algorithm presented in this paper perceives structure learning of a Bayesian network as a combinatorial optimization problem. It exploits the property of decomposability of the scoring function (whose cost in presented solution is constant) and thus allows to analyze the data more efficiently. 2. PROPOSED ALGORITHMS The algorithms that have been proposed to reliably asses Bayesian networks usability in discovery of statistical relationships consider problems of sampling, structure learning of Bayesian networks and transforming DAGs to PDAGs. a a

3 The problem of sampling Bayesian network is the same as the problem of obtaining a sample from a given distribution. In this case, the distribution is represented in a form of a DAG and is factorized according to equation (). Random variables in a Bayesian network depend directly on the their parents. Thus in order to determine the value of a random variable, say X i, it is necessary to obtain the values of random variables belonging to the parents P A Xi of X i. In presented approach the forward sampling algorithm 7 was used which starts from a root node and traverses a Bayesian network in readth-first order sampling each node according to the node s CPT. The proposed structure learning algorithm follows the search and score approach to learning Bayesian network topology. Search and score methods perceives structure learning as a combinatorial optimization problem, where a space of candidate DAGs (of n variables) is searched for the DAG which best approximates the joint probability distribution. In presented approach the algorithm designed for structure learning, given in listing, is an improved version of randomized greedy local search algorithm 8 and was inspired by solution presented in. 9 The algorithm searches the space of all DAGs (containing n variables) for a DAG with the highest value of the scoring function. Listing Randomized Local DAG search Problem: Find a DAG G that maximizes score Input: Data D, graph G (possibly empty), number of local search runs l Output: DAG G that maximizes score RandomizedLocalDAGSearch(D, G, l) 2: score best score(d, G) G best G 4: for i to l do G RandomDAG(getV ertices(g)) 6: score score(d, G) repeat 8: findmore false N G generateacyclicneighbourhood(g) : for all G j from N G do if score < score(d, G j ) then 2: G G j score score(d, G) 4: findmore true end if 6: end for until findmore = false 8: if score best < score then G best G 2: score best score end if 22: end for return G best The algorithm performs l number of local searches (lines 4-22), starting every time from a random DAG. At each step of the local search a neighborhood of the current DAG G is found by the procedure generateacyclic- Neighbourhood. The neighborhood of a DAG is a set of all DAGs that are obtained from graph G applying one of the following operations: if two nodes are not adjacent, add an edge between them in either direction provided that no cycle is introduced if two nodes are adjacent, remove the edge between them

4 if two nodes are adjacent, reverse the edge between them provided that no cycle is introduced Next all DAGs in the neighborhood of the current DAG G are scored and the one with the highest score is selected (line ). The graph becomes a current DAG G and so the local searching proceeds further with G as a new current DAG. The local searching is stopped when no graph from the neighborhood of G has higher score than G. When this is the case, the algorithm checks if the graph obtained from local searching has a higher score that the best graph found so far. If its score is better, then the graph becomes the best graph G best (line 9). The algorithm proceeds to the next iteration and a new local search begins starting from a random DAG. The algorithm stops after the specified number of l iterations. Randomization of an initial graph for each local searching is necessary to avoid local maxima. Starting only from one, for example, empty graph, would always lead to the same output graph since local searching is deterministic. Randomization, however, enables local search to explore complete DAG space as long as l is high enough. The fitness of a DAG to data is measured with a scoring function. The score used for ranking DAGs is a Bayesian-Dirichlet scoring function with an equivalent sample size, 8 given in equation (2). where, n M ijk N ijk N r i q i score(d, G) = n q i log ( Γ i= j N q i ( Γ ) N q i + r i k= M ijk is the number of random variables (attributes) in data D is the number of cases for which X i takes on its k-th value while the parents of X i in G are in their j-th instantiation is the hyperparameter N ijk = N r iq i is an equivalent sample size is the number of distinct values X i takes on is the number of distinct instantiations of parents of X i r i ( ) ) Γ(Nijk + M ijk ) + log (2) Γ(N ijk ) Generally obtaining the value of the scoring function requires many calculations on data (e.g. Gamma function is calculated by general formula Γ(x) = t x e t dt, due to not integer arguments). In case of RandomizedLocalDAGSearch algorithm the property of decomposability of the scoring function can be exploited since current graph s neighborhoods are scored, which differ from the current graph only by one edge. In this way, the insertion or deletion of an arc X j X i in a DAG G can be evaluated by computing only one new local score, score(d, X i, P A(X i ) X j ) or score(d, X i, P a(x i ) X j ) respectively, the reversal of an arc X j X i requires the evaluation of two new local scores score(d, X i, P a(x i ) X j ) and score(d, X j, P a(x j ) X i ). To improve the efficiency, a data set is first pre-processed to build an index based on associate container. The time complexity of RandomizedLocalDAGSearch algorithm is proportional to the number of local search iterations. Each local search comprises searches through neighborhoods of DAGs. The size of a neighborhood of a DAG G with n nodes is of order n 2, every such DAG is scored using scoring function, whose cost is constant, thanks to the property of decomposability and data indexing. If h is the average number of neighborhoods, that will be searched during one local search iteration the time complexity is O(lhn 2 ). Depending on the fitness of data to an initial graph, h can either be a small number, when a local search terminates quickly due to the lack of any better graph in the neighborhood of the current graph G or considerably ig number, when a chain of consecutive better graphs is long. When learning structure of Bayesian networks from raw data, relaying only on pure statistical characteristics of data, the output graph should be converted to a PDAG. In this work the PDAG transformation algorithm given in 8 was implemented. k=

5 3. VERIFICATION ON SYNTHETIC AND REAL DATA Functional testing of the software was split into two parts. The first part considered tests for the ability of reconstructing a given graph topology from the data sampled from this graph, the second part referred real SNP data detected by DNA microarray. For validation of the structure learning algorithm two metrics that give measure of Bayesian networks structures similarity were defined. Skeleton metric: where, E E 2 M s (G, G 2 ) = is a set of edges of G with disregard to their direction is a set of edges of G 2 with disregard to their direction V-structure metric: where, V is a set of v-structures of graph G V 2 is a set of v-structures of graph G 2 M v (G, G 2 ) = E E 2 max( E, E 2 ) (3) V V 2 max( V, V 2 ) (4) Elements of sets V and V 2 are defined as ordered triples of nodes (X i, X k, X j ) where node in the middle is the middle node of a v-structure X i X k X j. Note, that the higher the value of the two metrics the more similar are the two graphs. The maximum value of each metric is. If the M s (G, G 2 ) = and M v (G, G 2 ) =, G and G 2 are Markov equivalent, hence they belong to the same Markov equivalence class. 3. Tests on synthetic data The tests on synthetic data checked the ability of reconstructing a given graph topology from the data sampled from this graph. There was a number of experiments performed, in which every time an input Bayesian network of n vertices was specified and the data of size m was sampled from this network. Next, this data was used for reconstruction of the graph. Taking into account the stochastic nature of forward sampling, t independent data sets were sampled and graph reconstruction was performed for each of these t data sets separately. To estimate the algorithm s accuracy in reconstructing the input network, the skeleton and v-structure metrics were calculated for each of the obtained graph with respect to the input graph. The testing procedure is given in listing 2. Listing 2 Testing procedure Input: Bayesian network bn, data size m, number of data sets t Output: Statistics of skeleton and v-structure metrics for t samplings T estingp rocedure(bn, m, t) for i to t do Sample data D i of size m from bn Build graph G i from D i using algorithm Compare G i with the topology of bn using skeleton and v-structure metrics end for Calculate statistics of M i s and M i v The number of samplings t done for each input Bayesian network was introduced to eliminate influence of biased data on the structure learning algorithm s accuracy. The bigger the t, the more time it is spent to get the results, therefore t was set so the results could be obtained in reasonable time.

6 The accuracy of graph reconstruction was then expressed using statistics of calculated metrics Ms i and Mv. i The average algorithm reconstruction accuracy was characterized by the Ms mean = t t i= M s i and Mv mean = t t i= M v. i First binary networks of two and three nodes were tested to check if the Bayesian-Dirichlet scoring function could properly differentiate among basic structures and whether the proposed algorithm converges to the best graph. The structures tested were: independent nodes, chain nodes and v-structures. The CPTs in the graphs were specified do they reflect dependencies imposed by graph s structures. Next, to check whether ternary data can be equally well recreated as binary data, two networks of three ternary nodes were proposed. Then, nodes of four, five and six nodes were composed of the basic structures given above and the algorithm was tested on data sampled from these networks. The CPTs of basic structures were not changed, or slightly changed so that according to Markov condition a node would be independent of its non-descendants given its parents. Tests were also performed for various data sizes. The general approach was to run the testing procedure starting from some initial data size m init, increment the data size by some m inc, run the testing procedure for m = m + m inc and continue the process until m reaches some data cap m cap, for which the proposed statistics saturate. To illustrate the testing procedure the result for six node network depicted in figure 2 is described below. The network consists of one v-structure X 2 X 3 X 4, two pairs of dependent nodes X 3 X 5 and X 3 X 6 and one independent node X, the CPTs were specified to values shown in figure 2. P(=).5 P(=).5 P(X=).5 P(=) X.5.5 P(X5=).5 X5 X6 P(X6=).5 Figure 2. Six node network mean skeleton metric mean v structure metric data size data size Figure 3. Six node network - results, M mean s on the left M mean v The results in figure 3 indicate good average reconstruction facility. on the right in function of data size

7 The tests showed that when CPTs of an input Bayesian network reflect the dependency given by its structure it was possible to reconstruct the network from sampled data. The reconstruction rate for basic structures (for greater data sizes) ranged from 9% to 95% and for composed, more complex networks, it ranged from 75% to 95% both for equivalent skeletons statics and equivalent v-structure statistics. During tests, it was shown that the accuracy of structure reconstruction depends on the data size. The tendency is that the bigger the data size the more the output graph resembles the input graph. However, for each Bayesian network there seems to be some threshold data size above which no increase of algorithm s accuracy is observed. Oscillations appearing when data size reaches the threshold value are possibly caused by the forward sampling procedure. Sometimes the generated data set is better match by different network and hence the input Bayesian network would not be reconstructed accurately. Apart from the tests described above, tests on random Bayesian networks, which structure and CPTs were randomly generated, were performed. The generation procedure of a Bayesian network consisted of two steps, first generating topology and then generating entries to CPTs. The two steps were independent of each other, which posed a problem since it did not guarantee that the CPTs would reflect the dependencies entailed by the structure. Nevertheless, when CPTs were properly specified the reconstruction was possible. 3.2 Analysis of SNP data The goal of testing the structure learning algorithm on real data was to check how it handled real joint probability distributions and to propose an application of the algorithm for real-world problems. The task was to reconstruct a graph of affinity relationships between members of a family based on single nucleotide polymorphism data. The data was obtained from genotyping experiment conducted for an 8 person family, for which 24 biallelic SNP locus were observed. The observed values were AA, AB, BB and NoCall. The AA, AB, BB corresponded to alleles and were mapped to, 2 and 3 respectively, the NoCall indicated a missing or uncertain value and was eliminated. This rendered a ternary data set of size 8 by The graph of affinity relationships, given in figure 4, was known before running the structure learning algorithm. The edges in the graph identify parent-child relationships. X X8 X5 X7 X6 Figure 4. Input graph of affinity relationships (pedigree) The structure learning algorithm was run on the SNP data. The number of local searches was set to, meaning that the algorithm would perform independent searches every time starting from the random DAG. Figure 5 shows the two graphs, a PDAG of the input graph and a PDAG of the output graph found by the algorithm. There are 7 edges in the input graph and 4 edges in the output graph. All the edges of the input graphs are contained in the output graph and thus the skeleton metric of the two graphs is.5. There is one v-structure in the input graph, namely X 3 X 5 X 8 and two v-structures in the output graph X 3 X 5 X 8, which was successfully reconstructed and X 2 X 8 X 6, the v-structure metric of two graphs is also.5. The output graph additionally contains edges not present in the input graph. Some of the additional edges indicate the sibling relationships between the family members. From figure 4, we see that there are three sibling in the first generation: X 2, X 3, X 4 and two siblings in the second generation X 6 and X 7. All these siblings were discovered by the structure learning algorithm. There is also an edge X X 6 which express a cross-generation relationship.

8 X X X8 X8 X5 X7 X6 X5 X7 X6 Figure 5. Affinity relationship PDAG vs. discovered PDAG The only unexpected structure present in the output graph is the v-structure X 2 X 8 X 6. X 8 is not tied to the family in terms of gene inheritance, since it comes outside the family, and yet two nodes X 2 and X 6 points towards X 8 as if they were parents of X 8. This result leads to the conclusion that the discovered edges, apart from affinity relationships, might model some other unknown, yet statistically important feature. 3.3 Software development Most of the software that accompanied the project included a C++ library dedicated for programmers. Additionally, unit tests were developed for purpose of verification and validation of the library s modules. The project includes also a command line application, implemented to deliver most of functionality provided by the library without forcing a user to deal with programming issues. The software was developed in C++ using standard libraries as well as portable third-party libraries. Testing of the software was performed on Windows (Visual Studio) and Linux (GNU Compiler Collection) platforms. The library consists of four modules: sampling module structure learning module validation module serialization module Each of the module is based on the dependency injection design pattern, which removes direct dependencies between components and favors plug-in architecture. 4. CONCLUSIONS This article presented the algorithm for discovery of Bayesian network structure as well as its application to the analysis of genetic data. Prior to analysis of SNP data, the structure learning algorithm was verified on synthetic data. The tests indicated that the algorithm was able to find the best Bayesian network for a given data set as long as the size of data was sufficient. Testing the algorithm on real data accounted for analysis of almost ten thousand SNPs taken from an eight person family. Apart from familial relationships, the analyzed SNPs turned out to manifest also other dependencies as the algorithm pointed out one v-structure which did not fit in affinity relationships. Further studies on data could focus on finding a subset of SNPs, which solely determine the affinity relationships and a subset of those SNPs that cause the unexplained v-structure to appear. The main advantage of the presented algorithm lies in reduction of structure learning complexity, owing to local search heuristic for searching a space of DAGs and decomposability of the scoring function. The library implemented in C++ as well as the other tools are freely available for academic and commercial purpose. Library name: Homepage: Precompiled binaries: Programming languages: License: mugraph Windows XP/Vista, Debian Linux C++ GNU LGPL

9 REFERENCES [] Pearl, J., [Causality: models, reasoning, and inference], Cambridge university press (2). [2] Madigan, D., Andersson, S., Perlman, M., and Volinsky, C., Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs, Communications in Statistics-Theory and Methods 25(), (996). [3] Thorisson, G., Smith, A., Krishnan, L., and Stein, L., The international HapMap project web site, (25). [4] Lauritzen, S. and Sheehan, N., Graphical models for genetic analyses, Statistical Science, (23). [5] Cottingham Jr, R., Idury, R., and Schäffer, A., Faster sequential genetic linkage computations., American Journal of Human Genetics 53(), 252 (993). [6] O Connell, J. and Weeks, D., The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set recoding and fuzzy inheritance, Nature Genetics (4), (995). [7] Bishop, C. and service, S. O., [Pattern recognition and machine learning], Springer New York. (26). [8] Neapolitan, R., [Learning bayesian networks], Prentice Hall Upper Saddle River, NJ (23). [9] Cooper, G. and Herskovits, E., A Bayesian method for the induction of probabilistic networks from data, Machine learning 9(4), (992). [] Fowler, M., Inversion of control containers and the dependency injection pattern, Actualizado el 23.

A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks

A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks Yang Xiang and Tristan Miller Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2