Simulation Study on the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses

Size: px

Start display at page:

Download "Simulation Study on the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses"

Aubrie Moore
5 years ago
Views:

1 Simulation Study on the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses A Dissertation Submitted in Partial Fulfilment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in Genetics and Plant Breeding by SHENGCHU WANG Zhejiang University Hangzhou, Zhejiang, China 000

2 A Ph.D. DISSERTATION Simulation Study of the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses By Shengchu Wang Major: Genetics and Plant Breeding Supervisors: Dr. Jun Zhu and Dr. Zhao-Bang Zeng Zhejiang University Hangzhou, Zhejiang China 000

3 DEDICATION To My Wife, Xiu-Juan Rong And Daughter, Min-Xue Wang

4 Acknowledgments I like to express my special thanks to my advisor Dr. Jun Zhu for his important directions, encouragement and support for my doctoral study and dissertation research. The experience of studying with Dr. Zhu was beneficial and unforgettable. I would like to express my sincere thanks to Dr. Zhao-Bang Zeng for supporting me financially to do part of my dissertation research in US and giving me a lot of helps in my research work and my life while I stayed at NCSU, US. Thanks also to Dr. Bruce Weir for furnishing me the host lab and for good advice on my research work. I would like to express my gratitude to my wife and daughter for their support and patience. I am grateful to Dr. Xin-Fu Yan, Dr. Yue-Fu Liu, Dr. Rong-Ling Wu, Hai-Ming Xu, Ci-Xin He, and everyone who helped me during my dissertation research. I also wish to express my thanks to my colleagues of computer centre, Zhejiang University, for their support on my doctorial study and the dissertation research.

5 Abstract As the fast advance in molecular genetics, it is much easy to get well-distributed genetic markers in almost every organism nowadays. Therefore, as the major direction of quantitative genetics, vary statistical methods have been developed to detect or map quantitative trait loci (QTL) by using the genetic marker information. In this dissertation, the principles and models have been summarized for various QTL mapping methods. These methods include single marker analysis, interval mapping (IM), composite interval mapping (CIM), mixed-model-based composite interval mapping (MCIM), and multiple interval mapping (MIM). A large scale of simulation studies has been used for exploring and comparing various QTL mapping methods. The simulation study has indicated that although the single marker analysis has the ability to detect the QTLs but it cannot locate the positions of the QTL and obtain the estimation of the QTL effects. Simulations have also been conducted for studying and comparing different methods (IM, CIM, and MCIM) of QTL mapping under the simple additive situation. By analysing the LR profile, the power of QTL detection and the probability of false QTL detected can be calculated for the three methods under various situations. The estimation of QTL effects and positions as well as their 95% experimental confidence interval (ECI) for the detected QTLs is also obtained. The simulation results are useful to those who are using these three methods for QTL mapping practices. The results could be used as one of the bases for chosen the QTL mapping method among the available methods for a particular experiment design. The research can also provide the information for helping the analysis of the QTL mapping result. However, under the real QTL mapping experiments, more complicated situations such as QTL by environment interactions and QTL epistasis are existed generally. For IM and CIM methods, the simulation studies implied that the estimation of QTL main effects can be obtained unbiased by using data for all environments together. However, it is difficult to obtain the estimation of QE interaction effects, even by

6 doing QTL mapping on the data for different environment separately. MCIM method has the ability to put all QTL main effects and QE interaction effects into the mixed linear model and obtained the unbiased estimation of main and QE effects as indicated by the simulation study work. MCIM method can also use mixed linear model for mapping QTLs with marginal and epistatic effects. The simulation study has indicated that MCIM method can obtain the unbiased estimation of QTL marginal and epistatic effects at the same time. Although IM and CIM have the ability to get the unbiased estimation of marginal QTL effects when the QTL epistatic effects are existed, the variance for the estimation of marginal effects will increase largely too. On the other hand, the detection power of QTLs will go down and the probability of false QTL detection will go up apparently, especially for the CIM method as the simulation study indicated. MIM is a multiple QTL oriented method and it also has the ability to analysis the QTL epistatic effects. However, one of the crucial parts for MIM method is the criteria or stopping rule for model selection. We proposed a set of parameters for measuring the fitness between the selected model and the real model and an experimental criterion has been presented for model selection in the framework of QTL mapping by using simulation method. The criterion is a modification of BIC by adding relevant facts such as heritability, marker density, sample size, and chromosome numbers. The experimental criterion works fine in the simulation cases. A modified software version of QTL Cartographer has been developed and it is called Windows QTL Cartographer. Unlike original QTL Cartographer, Windows QTL Cartographer is the QTL mapping software with user-friend interface and powerful ability of graphic presentation for the mapping results. It has many users and been posted on the Internet: ( Key words: Computer simulation; QTL mapping methods; Quantitative trait loci; Model selection; BIC criterion

7 TABLE OF CONTENTS 1. INTRODUCTION HISTORY OF THE QTL MAPPING WORK MOLECULAR MARKERS EXPERIMENTAL DESIGN MODELS AND SOFTWARE SIMULATION VS. REAL DATA MAP FUNCTIONS AND MARKER ANALYSIS Map Functions Marker Order Analysis Marker Segregation Analysis PURPOSE OF THIS RESEARCH...1. REVIEW OF MAJOR QTL MAPPING METHODS -1 ONE MARKER METHOD Statistic Bases for One Marker Method.... The t -test Method Likelihood Ratio Test Method Simple Regression Method INTERVAL MAPPING METHOD Conditional Probabilities of QTL Genotypes...6. Genetic Model Maximum Likelihood Analysis Likelihood Ratio Test COMPOSITE INTERVAL MAPPING Properties of Multiple Regression Analysis Genetic Model Likelihood Analysis Hypothesis Test Marker Selection MIXED LINEAR MODEL APPROACH Genetic Model Likelihood Analysis Hypothesis Test A Model for GE Interaction A Model for QTL Epistasis MULTIPLE INTERVAL MAPPING SIMULATION STUDIES SIMULATION MODEL AND DATA Genetic Model for Simulation Parameter Setting...4 1

8 3. Simulation Procedure Format of the Simulation Data SINGLE MARKER ANALYSIS COMPARING DIFFERENT MAPPING METHOD Parameters Setting Estimation of QTL Effects Power and False Positive Positions and Effects of Detected QTLs The LR Profile CONSIDER THE COMPLICATED QTL MAPPING SITUATIONS Parameters Setting Performance of IM and CIM Methods Using MCIM Method MODEL SELECTION AND CRITERIA MIM AND MODEL SELECTION MODEL EVALUATION STANDARD MODEL SELECTION STRATEGY AND CRITERIA PROCEDURE OF MODEL SELECTION SUMMARY OF CRITERIA FOR MODEL SELECTION Adjusted R Mallow s C p (Mallows 1973) Mean Squared Error Prediction (Aitkin 1974, Miller 1990) BIC and Related Criteria SIMULATION STUDIES OF CRITERIA FW and BW Methods Criteria and the Various Parameters Experimental Criteria CONCLUSIONS AND DISCUSSION CONCLUSION THRESHOLD AND CRITERIA SOFTWARE DESIGN...84 REFERENCE 88

9 1. Introduction 1-1 History of the QTL mapping work It is believed that the rediscovery of Mendelian genetics in 1900 was beginning of the modern genetics. Through the demonstration on the inheritance of discrete characters, such as purple vs. white flower, smooth vs. wrinkled seeds, it is clear that the traits are controlled by genetics factors or genes, which will, inherited from generation to generation. Later on, great efforts have been made on understanding how the genes effecting the discrete characters or qualitative traits, especially the nature of the genes to transmit from the parents to their offspring. However, most economically as well as biologically important traits are not qualitative, but quantitative in nature. Here the quantitative means that the trait s value cannot be divided into several categories and the distribution of these values is continuously over a range in a population. The examples of the quantitative trait are of crop yield, plant height, resistance to diseases, weight gain in mice and egg or milk production in animals. Due to the complexity nature of the quantitative inheritance, the progress of quantitative genetics is far behind the Mendelian genetics. To partition phenotypic variance into various genetic and non-genetic variance components is the traditional way to study the quantitative traits. V P = V G +V e = V A +V D +V I +V e Here the phenotypic variance V P is partitioned into two components: genetic part V G and environmental and residual part V e. The genetic variance can be further partitioned into additive V A, dominance V D and epistatic V I variances. It is also possible to partition V G into other variance components according to the applications. For example: V G = V A +V D +V L +V M where V L is the sex linkage component and V M is the maternal variance component (Zhu and Weir, 1996). 3

10 These variance components can be estimated under the special breeding designs (Cockerham 1961, Eberhart et al. 1966, Falconer, 1996, Zhu 1998). These estimations allow us to evaluate the relative importance of various determinants of the phenotypic variance. The ratio V V is called as heritability in broad sense and V G P A P V is called as heritability in narrow sense or just heritability (h ). Heritability measures the degree that genes transmitted from parents to their offspring comparing to phenotypic deviation and it is useful in predicting the response to selection. The questions how the genes contribute to the quantitative trait values and why the trait values are continuously distributed may be answered partially by polygene theory (Johannsen 1909, Nilsson-Ehle 1909, East 1916). In this theory, a quantitative trait is controlled by many genes with small effects, and at the same time is also influenced easily by environment effects. However, it is very difficult to dissect the individual genes that controlling a quantitative trait by classical quantitative genetic means. Therefore, Breeders usually have no idea about the number, location and effect of the individual genes involved in the inheritance of target quantitative traits (Comstock 1978). These genes are also called quantitative trait loci (QTLs). It is impossible to manipulate the QTLs using genetic engineering method and through that to improve the organism s traits without obtaining the QTLs information, such as number, locations, and effects. The history of QTLs mapping can be traced back to 190 s. Sax (193) used the morphological markers to demonstrate an association between seed weight and seed coat colour in beans. Thoday (1961) used multiple genetic markers to systematically map the individual polygenes, which control a quantitative trait. He notices: The main practical limitation of the technique seems to be the availability of suitable markers. It is obvious that the numbers of the morphological or protein markers are very limited. Therefore, genetic markers are the nature choice for detecting or mapping QTLs. Nowadays, it is much easy to get well-distributed genetic markers in almost every organism, because the fast advance of molecular genetic technology. Vary statistical 4

11 methods have been developed to detect or map QTLs by using genetic markers information. Lander and Botstein (1989) proposed the interval mapping method (IM), which use two adjacent markers to bracket a region for testing the existence of a QTL by performing a likelihood ratio test at every position in the region. The method has been proven more powerful and requiting fewer progeny than one-marker methods. However, interval-mapping method has some drawback. Because it is a one QTL model, the mapping position of QTLs will be seriously biased when more than one QTL located at same chromosome (Knott and Haley 199; Martinez and Curnow 199). Later on, several attempts have been made to solve this problem. Zeng (1993) proved an important property of multiple regression analysis in relation to QTL mapping: If there is no epistasis, the partial regression coefficient of a trait on a marker depends only on those QTLs that are in the interval bracketed by the two neighbouring markers and is independent of QTLs located in other intervals. Zeng (1994) proposed an improved method called composite interval mapping (CIM) by combining interval mapping with multiple regression analysis. Jansen (1993) has also proposed a similar strategy. Composite interval mapping has proved having a better performance than interval mapping in multiple linked QTLs case. Recently an extended method called multiple interval mapping (MIM) has been proposed (Kao, Zeng and Teasdale 1999). This method fits all QTLs into the model altogether and has the ability for analysing QTL epistasis and the associated statistical issues. A new methodology was also proposed (Zhu, 1998, 1999; Zhu and Weir, 1998) for systematically mapping QTLs based on the mixed linear model approaches (MCIM). The MCIM method has very similar performance with Zeng s CIM method (See chapter 3). However, MCIM method does not have the problem of selecting the background control markers and setting the mapping windows size as CIM method does. MICM method also has the advantage that is very easy to extend for more complicated QTLs mapping situations such as QTL epistasis and QTL by environmental interaction etc. 5

12 1- Molecular Markers In classical Mendelian approach, the units of analysis are genetic variances rather than the underlying genes themselves. However, individual QTL can be dissected by using linked marker loci. This approach has long been recognized (Sax 193; Rasmusson 1933; Thoday 1961), but until recently it has been regarded as of minor importance because of the lack of sufficient genetic markers. Thanks to modern molecular biology, this situation has now been changed dramatically. The ability to detect genetic variation directly at the DNA level has resulted in an essentially endless supply of markers for any species of interest. Not surprisingly, there has been an explosion in the use of marker-based methods in quantitative genetics. The first molecular markers used were allozymes, protein variants detected by differences in migration on starch gels in an electric field. This class of markers has been extensively applied to a variety of genetic problems (Tanksley and Rick 1980; Delourme and Eber 199; Baes and Van Cutsem 1993; Kindiger and Vierling 1994). Allozymic variants have the advantage of being relatively inexpensive to score in large numbers of individuals, but there is often insufficient protein variation for high-resolution mapping. This is the reason why the rapid development of QTL mapping did not start with the advent of allozymic markers. As methods for evaluating variation directly at the DNA level became widely available during the mid-1980s, DNA-based markers largely replaced allozymes in mapping studies. DNA is the genetic material of organisms and genetic differences between individuals will be reflected directly by the nucleotide sequences of DNA molecules. There are effectively no limitations on either the genomic location or the number of DNA markers. A wide variety of techniques can be used to measure DNA variation. Direct sequencing of DNA provides the ultimate measure of genetic variation, but much quicker scoring of variation is sufficient for most purposes. These methods include Restriction Fragment Length Polymorphisms (RFLPs), Polymerase Chan Reaction (PCR), Randomly Amplified Polymorphic DNAs (RAPDs) and microsatellite DNAs 6

13 etc. There are several recently developed methods that include Representational Difference Analysis (RDA) and Genomic Mismatch Scanning (GMS). RFLPs is one of the simplest and wide used types of DNA marker. The approach is to digest DNA with a variety of restriction enzymes, each of which cuts the DNA at a specific sequence or restriction site. When the digested DNA is run on a gel under an electric current, the fragments separate out according to size. A variety of DNA from different individuals can generate length variation. If we attempted to score the entire genome for fragment lengths, the result would be a complete smear on the gel. Instead, individual bands are isolated from this smear by using labelled DNA probes that have base pair complementarily to particular regions of the genome. Each RFLP probe generally scores a single marker locus, and the marker alleles are codominant, as heterozygotes and homozygotes can be distinguished. The first use of the RFLP markers is in construction of human genetic map (Botstein et al. 1980; Doris-Keller et al. 1987), and this has been extended to analysis for other species (Beckmann and Soller 1983, 1986a, 1986b; Soller and Beckmann 1988). PCR is a rather different molecular marker approach that uses short primers for DNA replication to delimit fragment sizes. A opposite orientated region flanked by primer binding sequences that are sufficiently close together allows the PCR reaction to replicate this region, generation an amplified fragment. If primer-binding sites are missing or are too far apart, the PCR reaction fails and no fragments are generated for that region. RAPDs method (Williams et al. 1990) has the similar procedure that the sequence polymorphisms are detected by using random short sequences as primer. The advantage is that a single probe can reveal several loci at once, each corresponding to different regions of the genome with appropriate primer sites. They also require smaller amounts of DNA. However, RAPDs markers are dominant and the marker genotype can be ambiguous. Ragot and Hoisington (1993) conclude that RAPDs are suitable for modest number of individuals, while RFLPs are better for larger studies. Microsatellite DNAs, short arrays of simple repeated sequences tend to be very highly polymorphic. Since array length is cored, microsatellites are codominant, as 7

14 heterozygotes show two different lengths and hence can be distinguished from homozygotes. This kind of marker is especially suitable for outbred population because it is most efficient with marker loci having a large number of alleles. RDA and GMS are two recently developed advance methods. Both methods examine the entire genome, allowing one to isolate only those sequences that are shared by two populations (GMS) or those that differ between populations (RDA). Good use of these methods will very likely provide powerful approaches for the isolation of QTLs (Lander 1993, Aldhous 1994). Besides above commonly used markers, other categories of markers can also be very useful in some cases. The linear arrangement of the markers along the chromosomes or genome for the species is called marker linkage map. The map information is very important for vary QTL mapping research work. There are many saturated marker maps, which means markers covering whole genome in a reasonable distance, have been published in many organisms (Halward et al. 1993, Xu et al. 1994, Causse et al. 1994, Viruel et al. 1995, Hallden at al. 1996). Based on these kind of saturated maps, many research areas became more likely to be successful. These research works include studies on evolutionary process of organisms through comparative mapping (Lagercrantz et al. 1996, Simon et al. 1997), marker assisted selection to improve breeding efficiency (Lee 1995, Hamalainen et al. 1997) and marker based cloning (Xu 1994) etc. It is necessary to distinguish between the ideas of the physical maps and the genetic maps. The set of hereditary material transmitted from parent to offspring is known as the genome, and it consists of molecules of DNA (DeoxyriboNucleic Acid) arranged in chromosomes. The DNA itself is characterized by its nucleotide sequence that is the sequence of bases A, C, G or T. A physical map is an ordering of features of interest along the chromosome in which the metric is the number of base pairs between features. This is the level of detail needed for molecular studies, and there are several techniques available for physical mapping of discrete genetic markers or traits. However, in this paper genetic map are the main concern and that is the distances depending on the level of recombination expected between two points. An individual receives one copy of each heritable unit (allele) from each parent at each location 8

15 (locus) of the genome. The combination of units (haplotype) at different locations (loci) that the individual transmits to the next generation need not be one of the parental sets. Recombination may have taken place during the process of meiosis producing eggs or sperm. That is, through crossing over events alleles in diploids may come from either of the two parental chromosomes to form the haploid egg or sperm. Although there is generally a monotonic relation between physical and recombination distance, the relation is not a simple one. 1-3 Experimental Design To cross between completely inbred lines, which differ in the trait of interest, offer an ideal setting for detecting and mapping QTLs by marker-trait associations. The reason is by doing that all F 1 s are genetically identical and shows complete linkage disequilibrium for genes differing between the inbred lines. A number of designs have been proposed to exploit these features. These designs can produce various mapping populations that include backcross population, intercross population, doubled haploid population and recombinant inbred lines population etc. The most inbred lines cross design population are involved crop plants, however it is also applied to a number of animal species, especially mice (reviewed by Frankel 1995). Here we call the two different parental inbred lines (P 1 and P ), the one is low (L) line, and another one is high (H) line. The F 1 individuals receive a copy of each chromosome from each of the two parental lines, and so, wherever the parental lines differ, they are heterozygous. All F 1 individuals will be genetically identical and have the genotype of HL at each locus. Almost all-experimental designs are starting from the F 1 status. In a backcross design, The F 1 individuals are crossed to one of the two parental lines, for example, the high line. The backcross progeny, which may number from 100 to over 1000, receive one chromosome from the F 1, and one from high parental line. Thus, at each locus, they have genotype either HH or HL. As a result of crossing over during meiosis, which is the process during the formation of the gametes, the chromosome received from the F 1 is a mosaic of the two parental chromosomes. At 9

16 each locus, there is a half chance of receiving the allele from the high parental line and a half chance of receiving the allele from the low parental line. The chromosome received will be the alternation between stretches of L s and H s. Another common experimental design used in plants is the intercross design. F population is made from selfing or sib mating F 1 individuals. The F individuals receive two sets of chromosomes from the F 1 generation, each of which will be a combination of parental chromosomes. Thus, at each locus, the F individuals will have the genotypes of HH, HL or LL. The F population provides the most of genetic information among different types of mapping populations (Lander et al. 1987), and is relatively easy to be obtained. A doubled haploid (DH) population is composed of many DH lines that are usually developed from pollens on an F 1 plant through anther culture and chromosome doubling. The genotypes of the DH line s individuals are homozygous and are HH or LL in different locus along chromosome. DH populations are also called permanent population because there will be no segregation in the further generations. The advantage of the DH population is that the marker data can be used repeatedly in different locations and years under various experimental designs. However, the rates of pollens successfully turned into DH plants may vary with genotypes of pollens, and this will cause segregation distortion and false linkage between some marker loci. A recombinant inbred lines (RIL) population is constructed by selfing or sib mating individuals for many generations start from F by single seed descent approach till almost all of the segregating loci come to be homozygous. Some RIL populations have been developed in rice, maize and barley etc. recent year (Burr et al. 1988, Reiter et al. 199, and Li et al. 1995). The advantage of the RIL population is the genetic distances are enlarged compared to those obtained from F or BC populations. The reason is that many generations of selfing or sib mating increases the chance of recombination. Therefore, It may useful for the increasing of the precision in QTL mapping. However, it is not possible that all individuals in a RIL population are 10

17 homogeneous at all segregating loci through the limited generations of selfing or sib mating, which will decrease the efficiency for QTL mapping to some extent. People use different experiment design population for different QTL mapping research. In this dissertation, B 1 and DH population will be used as chief example because of its simplicity. At each locus in the genome, the progeny of B 1 or DH population have only two possible genotypes. However, the principles and results obtained here are very easy to extend to other experiment design populations. 1-4 Models and Software The QTLs information (numbers, positions, and effects etc.) of the experiment population is unobservable. Through the experiment, people can only observe the trait phenotype and marker information for each individual. The idea that genetic markers, which tend to be transmitted together with specific values of the trait, are likely to be close to a gene affecting that trait is the base for QTLs mapping. Therefore, the genetic and statistic models are very important for describing the data and abstracting the QTLs information from the data. Genetic models are used for describing the organism s genetic activity such as recombination events and additive, dominant, or epistatic phenomena etc. For more than two markers in a chromosome, the simplifying assumption is that recombination between any two of them is independent from others recombination events. This assumption is called no interference and the phenomenon of a single crossing over between DNA strands can be considered as a Poisson-process. Therefore, Haldane s mapping function (Haldane 1919) can be used for describing the relationship between recombination fraction r and genetic distance x. Statistical models are the methods to obtain the QTLs information from the experimental data through associate analysis and statistical calculation. Without the appropriate statistical model, there is no way to retrieve the QTL information from the experiment data, which includes the quantitative phenotypes and molecular markers. Therefore the statistical model is critical for mapping QTL and a large number of new models have been proposed since the 1980s (Weller 1986, Lander and Botstein 1989, 11

18 Haley and Knott 199, Jansen 199, 1993, Zeng 1993, 1994, Zhu 1998, Kao 1999 etc.). We can classify these statistical models (methods) base on the number of markers used or the techniques applied (Liu 1997, Hoeschele et al. 1991). The classification according to marker numbers includes single marker method, Flanking marker methods and multiple marker methods. It also can group the methods as least square methods, regression methods, maximum likelihood methods, and mixed linear model approach methods etc. In summary, these various methods differ from simple to complicated, from detecting QTL-marker association to locating QTLs position and estimation their effects, and from low resolution and power to high resolution and power. In the later chapters, we will discuss these methods in more details. It is possible to use calculator to solve statistic problems when the data set is not very large and the method is not too complicated. However, computer program is usually used when people analysis the data set by statistic means. There is several commercial software packages exist currently for statistical analysis purpose. These general-purpose statistical software packages include SAS, SPSS, SPLUS, and STATISTICA etc. It is likely to use these kinds of software to do the QTL mapping analysis (Haley and Knott, 199). However, the methods for QTL mapping are usually complicated and not standardized. It is usually not efficient sometime even impossible to map QTL by using these kinds of software package. Therefore, many computer programs based on specific statistical methods have been developed for QTL mapping purpose (Lander and Botstein 1989, Basten 1994,Wang 1999). Base on the classical interval mapping principles, Mapmaker/QTL (Lander et al. 1987) is one of the popular QTL mapping software. This software has different versions for PC, Mackintosh, and UNIX systems and it uses command-driven user interface. It means that a series of commands should be executed for different stages such as data input, doing various mapping functions and output the result. QTL Cartographer (Basten et al. 1994) is another popular QTL mapping software developed according to Zeng s composite interval mapping method (Zeng 1994). The 1

19 software also has different versions for PC and UNIX. However, the original software uses several commands to fulfil the mapping tasks and sometimes it is confusing. We have developed a windows-version of QTL Cartographer software that uses user-friend interface and graphic result representation. It is certain that the new version of the QTL Cartographer will be much easier to use and the software will be described in more details later. Other software is also available for QTL mapping, such as QTLSTAT (Liu and Knapp, 199), PGRI (Lu and Liu, 1995), MAPQTL (Van Ooijen and Maliepaard 1996) and Map Manager QTL (Manly et al 1996). Obvious, these programs are not as popular as Mapmaker/QTL and QTL Cartographer. However, It is believed that new method based QTL mapping software will be gradually accepted by genetic researchers over the time. Advanced statistical method and good user interface should be the most important facts for these kinds of software. 1-5 Simulation vs. Real Data Statistical model is used for describing the real biological or genetic system. Because this kind system is so complicated and some facts are unknown, it is impossible to include all the facts (parameters) into a model. Therefore, it is reasonable that there are several statistical models for QTL mapping research. Some of these are quit complicated and some others maybe very simple. The properties of an estimator for the statistical model can be obtained parametrically if the distribution of the estimator is known and well characterized. However, in most models for QTL mapping, it is usually too complicated to get the properties of the estimators parametrically. Therefore, computer simulation is necessary for obtaining the properties and checking the performance of the models and methods. This is no way to examination a model s performance by using real (experimental) data because the true parameter is unknown. The advantage of using computer simulation data is that we know the true parameters that can be used to compare with the estimator of the model. 13

20 The data for QTL mapping have two components, which include the map information and the cross information. The map information data set contains information of the marker positions and orders for each chromosome or linkage group for an experimental organism. Figure 1-1 is the estimated genetic map for X chromosome of the mouse species and the Table 1-1 is the map data in QTL Cartographer format Tpm3-rs9 DXMit3 Hmg1-rs14 Hmg14-rs6 DXNds1 Rp118-rs17 Hmg1-rs13 DXMit97 DXMit109 DXMit48 Rps17-rs11 DXMit16 DXMit57 Figure 1-1. Markers information of X chromosome for mouse data. The numbers are distance in cm between two markers and the labels are the marker s names. Table 1-1. Map data in QTL Cartographer format. 1 No Labels 3 Interval 4 Position No Labels Interval Position 1 Tmp3-rs DXNds DXMit DXMit Hmg1-rs Rp118-rs DXMit Rps17-rs DXMit DXMit Hmg14-rs Hmg1-rs DXMit Marker number, Marker name, 3 Marker position (cm) in interval format and 4 Marker position in position format. The cross information includes the trait values and the marker genotypes for each marker position of the individuals in an experiment population. Table 1- is the cross information of mouse data set (partial), which is the Backcross population. In the simulation study case, we can set the map information for a population and producing (sample) the cross information of each individual from the population according to various parameters such as QTL number, positions and distribution etc. 14

21 Table 1-. First 6 individuals cross data in X chromosome of mouse species. 1 Ind BW Markers on the X chromosome of the mouse species M M M M M M M M M M M M M M M 1 Individual number, One of the trait: Body Weight. 3 Marker genotypes in each marker position: 1- AA and 0 Aa. 1-6 Map Functions and Marker Analysis 1. Map Functions To obtain the marker information, such as position and order, along the chromosomes is very important for QTL mapping study of an organism. The state of a specific genetic marker is called the marker genotype. There are two states of marker genotype for Backcross (or DH) population. We can use 1 to represent MM genotype and 0 for Mm ( 1 for mm) on the marker M. Individuals sharing the same parents may have different genotypes for the same genetic markers. These differences provide the variation we need to statistically estimate the relationship between genetic markers for the purpose of resolving their linear order across chromosomes of the organisms. Recombination or crossover occurred during prophase I stage of meiosis is the reason for individuals with same parents may have different marker genotypes. That is during the production of gametes, an exchange of material between pairs of chromosomes may occur. People can detect and record the variation or recombinants by using laboratory techniques as marker genotype for each individual. There are several facts about the marker genotype: - The closer of the two markers, the less likely a recombination event is to occur. - Markers that reside on different chromosome are unlinked. 15

22 - Two markers that never experience a recombinant event between them are called completely linked. They travel together during the meiosis process. - If an even number of crossing over events occurs between two genetic markers, this event is undetectable. The number of odd crossovers (k) in an interval defined by two genetic markers has a Poisson distribution with mean θ, that is: k θ e Pr (recombination) = r = k! k θ = e θ θ θ e ( ) = 1! 3! θ θ ( e e θ ) = 1 (1 e θ ) (1-1) where θ is the number of map units M between two markers and here M stands for Morgan and one M equal to 100 cm (center Morgan). After solving above equation for θ gives Haldane s map function: 1 θ = ln(1 r) (1-) If let r equal to 0, the θ will be 0 too and it is the completely linkage case. If let r equal to 0.5, the θ will become and this means markers are unlinked. This case happened might be caused by the fact of the markers reside on different chromosomes or also markers on the same chromosome, but far apart. Table 1-3. Relationship between recombination frequency and map distance (M). Recom Haldane Kosambi If interference is taken into account, the Kosambi map function should be used: 1 1+ r θ = ln (1-3) 4 1 r Table 1-3 is relationship between r and cm using different map function. It is easy to conclude that comparing to Haldane function, as two markers become further apart, the value of Kosambi map function decreased. However, for very small values of recombination, both Haldane and Kosambi map function has similar value with recombination frequency. 16

23 . Marker Order Analysis It is necessary to estimate the probability of recombination between each pair of genetic markers. The recombination occurs in the F 1 gametes will be detectable in the backcross (B 1 ) generation. Assume we have two markers M and N, each having two versions or alleles M 1, M and N 1, N. The possible states or genotypes of the two genetic markers are M 1 /M 1, M 1 /M and N 1 /N 1, N 1 /N for B 1 population. If an offspring s genotype differs from the parental genotype at the markers, it means that a recombination event is observed. From Table 1-4 we can know easily that the total number of recombinant events is n + n 3. Therefore the estimation of the recombination frequency between marker M and N should be (n +n 3 )/(n 1 +n +n 3 +n 4 ). Maximizing likelihood method can also be used to solve this problem. n + n3 n1 + n4 The likelihood function to describe this situation is L( r) = Cr (1 r). To take the natural logarithm: ln L( r) = ln C + ( n + n3 )ln r + ( n1 + n4 )ln(1 r) To set the partial derivative with respect to r as 0 and solving the equation for r: ln L( r) n + n3 n1 + n4 = r r (1 r) = 0 And rˆ = 1 n n + n + n + n n 4 Table 1-4. The possible genotypes for Marker M and N of B 1 population. Marker genotypes N 1 / N 1 N 1 / N M 1 / M 1 n 1 n M 1 / M n 3 n 4 It is very easy to use above formula for calculating the pair wise recombination frequency between each pair of markers. By doing this calculations, we can decide the linkage groups. A linkage group is a group of markers where each marker is linked (r < 0.5) to at least one other marker. If a marker is not linked to any marker in a linkage group, it does not belong to that group, and most likely belongs to some other linkage group. In theory, the linkage group numbers should equal to chromosome numbers. However, sometime the linkage group numbers is greater than chromosome numbers because the sample variance and the limitation of the sample size. In other words, 17

24 some of the recombination events are not detected by the experiment. In this case, to increase the sample size or to do more experiments are necessary Figure 1-. A linkage group structure for simulation study. Numbers above the markers are distances of the two markers in cm and under are maker Table 1-5. Simulation data set of marker genotypes for a Backcross population. Indivi Markers Indivi Markers -duals duals AA AA AA Aa AA 16 AA AA Aa AA AA AA Aa Aa AA AA 17 Aa Aa Aa Aa Aa 3 Aa Aa Aa Aa Aa 18 Aa Aa Aa Aa Aa 4 AA Aa Aa AA Aa 19 Aa Aa Aa AA Aa 5 AA AA AA AA AA 0 Aa Aa Aa Aa Aa 6 AA Aa AA AA AA 1 AA AA AA AA AA 7 AA AA Aa Aa Aa AA AA AA AA AA 8 AA AA AA AA AA 3 Aa Aa Aa Aa Aa 9 Aa Aa Aa Aa Aa 4 Aa AA Aa AA Aa 10 AA AA AA Aa AA 5 Aa Aa Aa Aa Aa 11 Aa Aa Aa Aa Aa 6 AA AA AA AA AA 1 AA AA AA AA Aa 7 AA Aa Aa AA Aa 13 AA Aa Aa AA AA 8 AA AA AA AA AA 14 Aa Aa Aa Aa Aa 9 Aa AA Aa Aa Aa 15 AA AA AA AA AA 30 AA AA AA AA AA Table 1-5 is a simulation data set that includes marker genotypes of B 1 population with 5 markers and 30 individuals produced from the linkage structure showed in Figure 1-. Here the Haldane map function has been used. The numbers of the recombination events between two markers, which are the counts of changing from genotype AA to Aa or from genotype Aa to AA, are presenting in Table 1-6. Table 1-6 also includes the recombination frequencies that are the numbers of the recombination events divided by total individual number 30. It is very important to know the makers orders and positions along the linkage group or chromosome. We can estimate this information from the table of recombination frequencies (Table 1-6). From Table 1-6 we know the smallest value is 0.13 and they are the recombination frequencies between marker 1 and 5 or between 18

25 marker 3 and 5. Here choice 3-5 as starting point (can choice 1-5 also). Then finding the smallest value either from marker 3 side (-3 is 0.17) or from marker 5 side (5-1 is 0.13) and the new order become The next maker picked is 4 (1-4 is 0.17) and the new order is Therefore the final orders are After obtaining the markers order, it is easy to estimate the map distance between markers by using recombination frequencies and appropriate map function. For example, the recombination frequencies between marker and 3 are 0.17 and the distance will be 0.8 cm by using formula (1-) to calculation. The final result is in Figure 1-3. Table 1-6. The count (frequencies) of recombination events. Markers (0.00) 7(0.3) 6(0.0) 5(0.17) 4(0.13) 0(0.00) 5(0.17) 10(0.33) 7(0.3) 3 0(0.00) 9(0.30) 4(0.13) 4 0(0.00) 7(0.3) 5 0(0.00) Figure 1-3. Estimated linkage group structure for the simulation data set. Comparing Figure 1-3 to Figure 1-, the markers order of estimation is correct but the distances between markers are not very accurate. It is quite reasonable for considering such a small sample size (only have 30 individuals). As the sample size increased, the estimation will be more precise. From this simple example, it seems quite easy to obtain the markers order and the estimators of the marker distance by counting the recombination events. However, as marker number increase, the problem of ordering a set of genetic markers will become very difficult. This problem is equivalent to the famous Travelling Salesman Problem. One of the criteria for comparing two different orders is to minimize the 19

26 Sum of Adjacent Recombination Fractions (SAR). For above example, the SAR value for the final order is = The other criterion includes SAL standards for Sum of Adjacent Likelihood Functions. The main problem for ordering the markers is not the criterion but the computation time. As the marker number increased, the numbers of possible orders will quickly become unmanageable by means of computation. Therefore, the only way to solve this problem is to find the better (not necessary the best) order through some kind of searching procedures. Several methods have been proposed since These methods include Branch and Bound (Thompson, 1984), Simulated Annealing (Weeks and Lange, 1987), Seriation (Buetow and Chakravarti, 1987a, 1987b), and Rapid Chain Delineation (Doerge, 1993) etc. There are numbers of software available for ordering markers and estimating distance between markers, MAPMAKER (Lander etc 1987) is one of it. 3. Marker Segregation Analysis It is also important to do the Mendelian segregation test for each marker to test the segregation distortion of the markers. By expectation, the segregation ratio should be 1:1 for population of BC, DH, or RIL and 1::1 for the intercross population. In backcross population, to across between A/A and A/a produces the zygotes AA and Aa with the same expected number of n/. Table 1-7 shows the expected number and observed number for above simulation data set as showed in Table 1-5. A test statistic can be constructed by using χ under the null hypothesis, p(aa) = p(aa) = 0.5 (Mendelian Segregation), as showed in formula (1-4). In this example, the individual number n = 30 and n 1 and n is observed number for genotype AA and Aa in each marker position. χ = ( Obs.# Exp.#) Exp.# ( n = 1 n ) n ~ χ 1 (1-4) Rejecting H 0 means the deviation from Mendelian segregation is significant and this phenomenon is called segregation distortion. Segregation distortion can be caused by sample variation. However sometimes it is caused by genetic reason such as the 0

27 selection force on different types of zygotes is different. Significant segregation distortion can bias estimation of recombination frequency (distance) between markers. It can also reduce the power to identify QTLs and bias the estimation of QTLs positions and effects. Table 1-7. Marker segregation analysis for the simulation data set. Markers Marker 1 Marker Marker 3 Marker 4 Marker 5 Genotypes AA Aa AA Aa AA Aa AA Aa AA Aa 1 Frequency under H 0 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½ Expected number Observed number χ value p-value >0.50 >0.995 >0.50 >0.50 > H 0 : null hypothesis. 1-7 Purpose of This Research The purpose of the QTL mapping practice is to identify or locate various QTLs along the chromosomes for a species through special experimental design and genetic markers information. The QTLs information such as number, locations, and effects can help geneticist and breeders to improve the quality and quantity of the plants or animals. However, the fundamental of the QTL mapping methods is based on statistic principles. It is important to understand the statistic principles before using a particular QTL mapping method to analysis the experimental data set. Moreover, it is also useful by comparing different QTL mapping methods to understand the performances of the various methods under difference circumstance. This kind of study can help users to choose the appropriate QTL mapping method according to their experiment requirements and provide the basis for understanding the result after QTL mapping analysis. In this research, a large scale of computer simulation has been conducted for studying and comparing the performances of the major QTL mapping methods. These methods include Interval Mapping method, Composite Interval method, and Mixedmodel based CIM mapping method. We have also conducted a series of simulation researches for identifying the model selection criteria that are the critical part for the multiple QTL mapping methods. The computer software accompany with a particular 1

28 QTL mapping method is very important because the QTL mapping method is usually too complicate to use without the computer software. However, the most QTL mapping software existed are using command drive system as its interface and it is usually not very convenience to use. We have developed a QTL mapping software with user friend interface and result visualization ability. The software is called Windows QTL Cartographer (Wang et al. 1999) that has been posted on the Internet and has many users.. Review of Major QTL Mapping Methods -1 One Marker Method One marker method is based on the simple idea that if there is an association between marker type and trait value, it is likely that a QTL locus is close to that marker locus. The approach has been applied in many studies of QTLs for various organisms such as Drosophila (Thoday, 1961), maize (Edwards et al, 1987) and tomato (Weller, Soller and Brody, 1988). Table -1. Trait mean and distribution for various populations. Population Genotype Mean Distribution P 1 1 MQ / MQ µ 1 = µ + a N( µ 1, σ ) P mq / mq µ = µ a N( µ, σ ) F 1 MQ / mq µ 1 = µ + d N( µ 1, σ ) 1 M or m means marker and Q or q indicates QTL. Table -. Frequencies and mean effects for various marker-qtl genotypes in B 1 population. Genotype MQ / MQ MQ / Mq MQ / mq MQ / mq Frequency 1 (1 r)/ r/ r/ (1 r)/ Mean effect µ + a µ + d µ + a µ + d 1 r is the recombination frequency between marker and QTL. 1. Statistic Bases for One Marker Method Suppose that two parental inbred lines differ sufficiently in the quantitative trait that we are convinced there are QTLs responsible for the trait difference. Assuming the trait values of the two parental lines and F 1 population are normally distributed as

29 showed in Table -1. The frequencies and mean effects of various marker-qtl genotypes for B 1 population (P 1 F 1 = MQ / MQ MQ / mq) are showed in Table -. Although we cannot observe the QTL genotypes, but the marker genotypes are observable and the mean effect for various marker genotypes in B 1 population is showed in Table -3. If only one QTL is linked to marker M, the mean difference between the two marker types in B 1 population is showed in formula (-1). If ignoring epistatic effect, the mean different effect for the situation of multiple QTLs linked to the marker M is showed in formula (-). Table -3. Mean effects for various marker genotypes in B1 population. Marker Types QQ Qq Mean Effect 1 Frequency 1 r r MM Effect µ + a µ + d µ MM = (1 r)(µ + a) + r(µ + d) Frequency r 1-r Mm Effect µ + a µ + d µ Mm = r(µ + a) + (1-r)(µ + d) 1 The frequency of various QTL genotypes, r is recombination frequency between Q and M. µ MM - µ Mm = [(1 r) (µ+a)+r(µ+d)] [r(µ+a)+(1 r)(µ + d) ] = (1 r)(a d) (-1) m ( 1 rik )( µ µ = a d ) (-) MM Mm k = 1 k k. The t -test Method From formula (-1), it is easy to know that if the difference in means of the two marker genotypes is not zero, it can be inferred that r 0.5, since it is known that δ = ( a d) 0. Therefore, we can use the t-test statistic (formula -3) to test for linkage between marker M and QTL Q. Hypotheses: H0: µ MM µ Mm = 0 H1: µ MM µ Mm 0 µ t-test statistic: t = MM µ Mm ~ t( n1 + n ) s 1 1 p + n1 n (-3) Here and n represent the number of individuals belong to MM and Mm n1 genotypic marker classes, respectively. 3

30 s p ( n = 1 1) s1 + ( n 1) s n + n 1 The s 1 is the estimate of the variance for MM marker class individuals and the s is the estimate of the variance for Mm marker class individuals. 3. Likelihood Ratio Test Method For a normal distribution variable Y ~ N( µ, σ ), the likelihood for the parameters is ( Y µ ) 1 σ ( µ, σ ) L = e. πσ The phenotypic distribution for B 1 population is a mixture normal as following: Y ~ (1 r) N( µ, σ ) + rn( µ, ) For MM marker genotype 1 1 σ Y ~ rn( µ, σ ) + (1 r) N( µ, ) For Mm marker genotype 1 1 σ The likelihood function of any one marker for the backcross scenario is showed in formula (-4). The hypothesis of no linkage can be tested with likelihood ratio statistic. Hypotheses: H 0 : r = 0.5 H a : r<0.5 L ( ˆ, ˆ, ˆ, < 0.5) Likelihood ratio test statistic: = ln a µ 1 µ 1 σ r λ (-4) L ( ˆ, ˆ, ˆ 0 µ 1 µ 1 σ, r = 0.5) The estimates of µ, µ σ will be different for r being estimated or set to , In practice, a set of different values of r is tried and the LR score demonstrates how much more likely the data are if there was QTL present as compared to the situation when there is no QTL present. Then, the peak of the LR score can be used to compare with the threshold value, which is derived according to the significance levels. L(µ 1, µ 1, σ, r; x 1,, x n, y 1,, y n ) = n 1 (1 r) N( µ, σ ) + rn( µ, σ ) + rn( µ, σ ) + (1 r) N( µ, σ ) 1 1 i= 1 i= 1 n 1 1 4

31 4. Simple Regression Method The simple regression model is Y i = β 0 + β1x i + ε i and i = 1,,, n is the β 0 individual index. is the overall mean and is the additive effect of the QTL when an allele substitution is made from the recurrent parent to the non-recurrent β 1 parent. X i is the indicate variable, which has the value ½ for carrying non-recurrent marker (M) and ½ for carrying recurrent marker (m) by the individual. Table -4. Possible outcomes for one marker one QTL situation. Genotypes Frequency X value Y value MQ / MQ 1 (1 r ) / ½ µ 1 MQ / mq r / ½ µ 1 MQ / Mq r / ½ 3 µ 1 MQ / mq (1 r) / ½ µ 1 1 r is recombination frequency between the marker and the QTL. µ 1 is mean value of P 1 genotype. 3 µ 1 is mean value of F 1 genotype. From Table -4, we have: E(X) = [(1 r)/](1/) + (r/)( 1/) + (r/)(1/) + [(1 r)/]( 1/) = 0 E(X ) = [(1 r)/](1/) + (r/)( 1/) + (r/)(1/) + [(1 r)/]( /) = ¼ σ X [ E( X )] E( X = = ) ¼ σ XY = E ( XY ) = [(1 r)/](1/)(µ 1 )+(r/)( 1/)(µ 1 )+(r/)(1/)(µ 1 )+[(1 r/)]( 1/) (µ 1 ) = ¼ (1 r)(µ 1 µ 1 ) = ¼ (1 r)(a d) β σ 1 = XY = (1 r MQ )( a d) σ X Therefore to test the slope of the regression model to see it is zero or not has the same meaning as a t-test introduced above. - Interval Mapping Method When only one marker has being used in QTL mapping, the effects are underestimated and the position cannot be determined. In order to overcome those drawbacks, Lander and Botstein (1989) introduced the interval mapping as a 5

32 systematical way to scan the whole genome for evidence of QTL. Interval mapping method is an extension of one marker analysis by using two flanking markers to construct an interval for searching a putative QTL within the interval. The concept of using complete marker linkage maps for genomic scanning of QTL is important and the idea of viewing QTL genotypes as missing data and using a mixture model for maximum likelihood analysis is influential. The basic idea for interval mapping is simple. We first consider an interval between two observable markers M and N, each having two possible alleles for Backcross population. The genetic distance or recombination frequency between the two markers has been previously estimated. A map function (either Haldane or Kosambi) is utilized to translate from recombination frequency to distance or vice visa. To calculate a LOD score at each increment (walking step) in the interval and finally to get the profile of LOD score for whole genome. When a peak has exceeded the threshold value, we declare that a QTL have been found at that location. 1. Conditional Probabilities of QTL Genotypes The basic element upon which the formal theory of QTL mapping is built is the probability of the QTL genotype conditional on the observed marker genotypes. From the definition of a conditional probability, we have Pr( QMN) Pr( Q MN) = (-5) Pr( MN) The joint and marginal probabilities, Pr(QMN) and Pr(MN), are functions of the experimental design and the linkage map. When computing joint probabilities involving more than two loci, one must also account for recombination interference between loci. When considering a single QTL flanked by two markers M and N, the gamete frequencies depend on three parameters: the recombination frequency r 1 between markers, the recombination frequency r 1 between marker M and the QTL, and the recombination frequency r between the QTL and marker N. 6

33 Table -5. The probability of the QTL genotype condition on marker classes in B 1 population. Mk Class 1 Prob1 Genotype Prob 3 Conditional (Prob / Prob1) MN / MN (1 r 1 )/ MQN / MQN (1 r 1 )(1 r )/ Pr(QQ) = [(1 r 1 )(1 r )] / (1 r 1 ) 1 MQN / MqN r 1 r / Pr(Qq) = r 1 r / (1 r 1 ) 0 MN / Mn r 1 / MQN / MQn (1 r 1 ) r / Pr(QQ) = (1 r 1 ) r / r p MQN / Mqn r 1 (1 r )/ Pr(Qq) = r 1 (1 r ) / r 1 r 1 / r 1 = p MN / mn r 1 / MQN / mqn MQN / mqn r 1 (1 r )/ (1 r 1 ) r / Pr(QQ) = r 1 (1 r ) / r 1 r 1 / r 1 = p Pr(Qq) = (1 r 1 ) r / r p MN / mn (1 r 1 )/ MQN / mqn r 1 r / Pr(QQ) = r 1 r / (1 r 1 ) 0 MQN / mqn (1 r 1 )(1 r )/ Pr(Qq) = [(1 r 1 )(1 r )] / (1 r 1 ) 1 1 Probability of the marker class. Probability of the marker QTL genotype. 3 Conditional probability for the QTL genotype according to formula (-5), here p equal to r 1 / r 1. Under the assumption of no interference assumption (Haldane), the relationship between r 1 and r 1, r will be r1 = r1 + r r1 r, while r 1 = r1 + r under complete interference (Kosambi). When r1 is small, gamete frequencies are essentially identical under either interference assumption. Because the QTL is unknown, we can only use the observable marker genotype to infer the QTL genotype. Table -5 shows the probability of the QTL genotype according to the two flank markers genotypes.. Genetic Model For a backcross population, to analyse a QTL located on an interval flanked by marker M and N, the interval mapping method assumes the following linear model. * y = µ + b x + e j = 1,,, n (-6) j * j j where * b = The effect of the putative QTL * 1 x j = 0 if if the QTL genotype is QQ the QTL genotype is Qq e j ~ N(0, σ ) In the model, the variable x * is used for indicating the QTL genotype which are unobserved. However, the probabilities of possible QTL genotypes can be inferred by given the genotypes of two flank markers as showed in Table -5 and the summary is showed in Table -6. For backcross population, we can define 7

34 p kj * = Pr ob( x = k M, N, p) k j = 0, 1. where p = r r 1 1 and the approximation is obtained by assuming that the double recombination events can be ignored. Table -6. The probabilities of possible QTL genotypes condition on marker classes. Maker Classes Numbers QQ(1) (1 r )( 1 r 1 r 1 MN / MN n1 1 1 MN / Mn n p 1 (1 r )( r ) 1 r 1 MN / mn n3 p 1 ( r )( 1 r ) 1 r 1 MN / mn n4 0 1 ( r )( r ) 1 r 1 ) 1 QTL Genotype Qq(0) ( r 1 )( r ) 1 r 1 ( r1 )( 1 r ) 1 r 1 (1 r1 )( r ) 1 r 1 (1 r 1 )( 1 r 1 r 1 0 p 1 p ) 1 3. Maximum Likelihood Analysis For model (-6), there are two possible QTL genotypes each of that can be true with a certain probability. The distribution of the model is a mixture normal distribution and the likelihood function can be defined as n * y = j µ b y j µ * L( µ, b, σ, p) = p + 1 jφ p0 jφ (-7) j 1 σ σ φ z = 1 z e π where ( ) ( ) is the standard normal density function. In likelihood function (-7), the parameters include: µ - the mean of the model * b - the effect of the putative QTL p = r r the position of the putative QTL related to the flank markers σ - residual variance of the model The data of the analysis include: y j - Phenotypic value of a quantitative trait for each individual Genotypes of markers for each individual that contribute to the analysis of p, k = 1, ; j = 1,,, n k j 8

35 The maximum likelihood analysis of a mixture model is usually through an Expectation-Maximization algorithm. EM is an iterative procedure and the E-step for likelihood function (-7) is to calculate: P j = p 1 j φ * p1 jφ( [ y j µ b ] σ ) * ([ y µ b ] σ ) + p φ( [ y µ ] σ ) j The M-step is to calculate: ˆµ = n j= 1 n ( ˆ* y P b ) n j ( y j ) Pj bˆ * = µ j= 1 1 = n j= 1 j n j= 1 P n * [ ( y j ) Pjb ] ˆ σ µ j This process is iterated until convergence of estimates. 0 j j 4. Likelihood Ratio Test The test statistic can be constructed using a likelihood ratio in LOD (likelihood of odds) score: LOD = log 10 * L ( ˆ µ, b = 0, σˆ ( ˆ, ˆ * L µ b, σˆ ) ) Under the hypotheses * * H 0 : b = 0 and H1 : b 0 By assuming that the putative QTL is located at the position indicated by p = r r 1 1 ˆ, we can get the maximum likelihood estimates of µ, b *, σ under H1 as * * ˆµ, b, ˆ σ and under H0 as ˆ µ, ˆ σ with b constrained to zero. That the LOD score test is essentially the same test as the usual likelihood ratio test: * L( ˆ µ, b = 0, ˆ σ ) LR = ln ( ˆ, ˆ* L µ b, ˆ σ ) And we have the relationship between LOD value and LR value as 9

36 1 LOD = 10 = 17 ( log e) LR 0. LR The test can be performed at any position covered by markers and thus the method creates a systematic strategy of searching for QTL. The amount of support for a QTL at a particular map position is often displayed graphically through the use of likelihood maps profile, which plots the likelihood ratio test statistic as a function of map position of the putative QTL. If the LOD score at a region exceeds a pre-defined critical threshold, a QTL is indicated at the neighbourhood of the maximum of the LOD score with the width of the neighbourhood defined by one or two LOD support interval (Lander and Botstein 1989). By the property of the maximum likelihood analysis, the estimates of locations and effects of QTL are asymptotically unbiased if the assumption that there is at most one QTL on a chromosome is true. The test statistic LR for a given position is expected to be asymptotically chi-square distributed with one degree of freedom under the null hypothesis for the backcross design and with two degree of freedom for the F design (Lander and Botstein 1989, Van Ooijen 199, Zeng 1994). However, because the test is usually performed in the whole genome, there is a multiple testing problem. The distribution of the maximum LR or LOD score over the whole genome under the null hypothesis becomes very complicated. An asymptotic theory, which is based on an Orenstein-Uhlenbeck diffusion process for determining appropriate genome-wise critical values, has been developed by Lander and Botstein (1989), Feingold et al. (1993) and Lander and Schork (1994). Lander and Botstein (1989) suggested that a typical LOD score threshold should be between and 3 to ensure a 5% overall false positive error for detecting QTL. -3 Composite Interval Mapping For interval mapping method, the estimated locations and effects of QTL tend to be asymptotically unbiased if there is only one segregating QTL on a chromosome. However, if there is more than one QTL on a chromosome, the test statistic at the position being tested will be affected by all those QTL and the estimated positions and 30

37 effects of QTL identified by this method are likely to be biased. Ghost QTL problem. One of the reasons for these shortcomings is that the test used in interval mapping method is not an interval test. An interval test is that the effect of the QTL within a defined interval should be independent of the effects of QTL outside the region. Otherwise, even when there is no QTL within an interval, the likelihood profile on the interval can still exceed the threshold significantly if there is a QTL at some nearby region on same chromosome. In order to overcome the shortcoming of interval mapping method, Zeng (1994) proposed an improved method called composite interval mapping by combining interval mapping with multiple regression analysis. Let us first review some relevant theory in multiple regression analysis for QTL mapping (Zeng 1993). 1. Properties of Multiple Regression Analysis Due to the linear structures of locations of genes on chromosomes, multiple regression analysis has a very important property. That is the partial regression coefficient of a trait on a marker is expected to depend only on those QTLs that are located on the interval bracketed by the two neighbouring markers. It is independent of any other QTL outsides the region if there is no crossing over interference and no epistasis. However, interference and epistasis will introduce non-linearity in the model. Suppose we regression trait value y on t markers observed in B 1 population: y j = µ + t k = 1 b k x jk + e j where x is the indicate value (1 or 0) of the k th marker in the j th individual, jk and b is the partial regression coefficient of the phenotype y on the k th marker k conditional on all other markers. b can also be denoted as and denotes a k b yk. sk s k set which includes all markers except the k th marker. 31

38 Since x takes a value of 1 or 0 with equal probability, the variance of the k th jk marker in the population is σ k = 1 4. It is easy to show that the covariance between the i th and k th markers is σ = 1 r ) 4 and is the recombination ik ( ik frequency between marker i and marker k. The covariance between the trait value y r ik and the k th maker is: m σ yk = (1 ruk ) δ u u= 1 4 where δ u is u th QTL effect. With these basic equations, any conditional variance and covariance can be derived. The variance of marker k conditional on marker i is: σ δ k. i δ k δ ik / δ i [ 1 (1 r ) ] 4 = r ( r ) = = 1 Because without interference, we have: ( 1 r ) = ( 1 r )( 1 r ) for order ilk or kli ik il kl ik The covariance between markers i and k conditional on marker l is: = σ 0 = r ril σ ik. l ik ilσ kl / σ l = ( 1 rkl )( 1 rik ) ( 1 r )( 1 r ) kl il ik [( 1 r ) ( 1 r )( 1 r )] ik il ik ik kl for order ilk or kli for order ikl or lki for order lik or kil The above result shows that conditional on an intermediate marker, the covariance between two flanking markers is expected to zero and from this property Zeng (1993) shows: r ( 1 r )( 1 r ) ( k 1) u k 1) u uk byk. sk = au + k u k r( k ) k ( r( k ) k ) < k u k 4 ( 1 r )( 1 r ) ( k + 1) u( k + 1) u ( 1 r ) ( k + 1) k ( k ) 1 < + 1 k + 1 where the first summation is for all QTLs located between marker k-1 and k and the second summation is for all QTLs located between marker k and k+1. This is a very desirable property that the regression coefficient depends only on those QTLs that are located between marker k-1 and k+1. That was the property that can be used to create an interval test in which we can test whether there are QTLs within a marker interval. r u r ku a 3

39 There are also other properties of the multiple regression that have direct relevance to QTL mapping. These are summarize as follows: Conditioning on unlinked markers in the multiple regression analysis will reduce the sampling variance of the test statistic by controlling some residual genetic variation and thus will increase the power of QTL mapping. Conditioning on linked markers in the multiple regression analysis will reduce the chance of interference of possible multiple linked QTL on hypothesis testing and parameter estimation, but with a possible increase of sampling variance. Two sample partial regression coefficients of the trait value on two markers in a multiple regression analysis are generally uncorrelated unless the two markers are adjacent markers.. Genetic Model Composite interval mapping is an extension of interval mapping with some selected markers also fitted in the model as cofactors to control the genetic variation of other possibly linked or unlinked QTL. To test for a QTL on an interval between adjacent markers M i and M i+1, the model will be: y j * * = µ + b x + b x + e (-8) j k k jk j * where x refers to the putative QTL and x refers to those markers selected for j genetic background control. Appropriate selection of markers as cofactors is important and will discuss later. jk 3. Likelihood Analysis The likelihood function of formula (-8) is specified as: n * y = j X j B b y j X j B * L( b, B, σ ) = p + 1 jφ p0 jφ j 1 σ σ where X j B = µ + bk x jk and the maximum likelihood estimates of the various k parameters are given below (use EM algorithm): 33

40 P j = p 1 j φ * p1 jφ[ ( y j X j B b ) σ ] * [( y X B b ) σ ] + p φ[ ( y X B) σ ] j j 0 j j j bˆ n ( y X B) P P = ( Y XB) n * = ˆ j j j= 1 j j= 1 j P c where c = Bˆ = nσ = n j= 1 P, Y = j { y }, P = { P } j n 1 ( X X ) ( ) 1 X Y Pb ˆ* ( Y XBˆ ) ( Y XBˆ ) b c ˆ*, j n 1 and the prime denotes matrix transposition. 4. Hypothesis Test The hypotheses to be tested are H 0 : b * = 0 and H 1 : b * 0. The likelihood function under null hypothesis is: n * ( = 0, B, σ ) = L b j= 1 y φ j X σ j B The maximum likelihood estimates of B and σ are: B = 1 ( X X ) X Y and σ ˆ = Y XBˆ Y XBˆ n ˆ The likelihood ratio (LR) test statistic is: * L b = 0, Bˆ, ˆ σ LR = ln L b ( ˆ*, Bˆ, ˆ σ ) Like interval mapping method, the test can be performed at any position in a genome covered by markers and it is easy to perform a systematic search for QTLs in a genome. As the test statistic is almost independent for each interval, a test on each interval is more likely to test for a single QTL only. 5. Marker Selection The main difficulty to use composite interval mapping method is to answer the question which markers should be added into the model before searching the QTL. There is no simple solution for this question because the answer depends on the 34

41 number and positions of underlying QTLs and the information is not available before QTL mapping. Too few markers selected may not achieve the purpose of reducing the most residual genetic variation and too many markers selected may reduce the power of the analysis. The practical implement of the marker selection in QTL cartographer software has two steps. In the first step, the selecting procedures such as forward, backward, or stepwise regression selects n p markers that are significantly associated with the trait. In the second step, a testing window is defined for blocking the markers inside the window is used for the test model. The window is constructed by use a parameter W that is the distance (cm) between the testing interval (one for each direction) and the nearest marker picked for the model. Then, those selected n markers that are outside of the testing window are also fitted into the model to reduce the residual variance. The different conditions of the composite interval mapping can be created by p s changing the values of n and Ws. Generally should be much smaller than n, p n p not exceeding n (Jansen 1994), or alternatively it can be determined automatically by F-to-enter or F-to-drop criterion in the forward or backward regression analysis. W s should at least 10 or 15 cm depending on sample size. -4 Mixed Linear Model Approach As introduced above, CIM method is based on fixed multiple regression models. Zhu (1998) suggested a new methodology for mapping QTL by using mixed linear model approach that was called mixed-model based composite interval mapping (MCIM). Unlike CIM method, MCIM method consider the marker effects as random effects and by doing so, the obvious advantage is that the model can be extended easily for more complicated QTL mapping situation, such as QTL by environment interaction and QTL epistasis etc. 35

42 1. Genetic Model For B 1 or DH population, to analyze a QTL located on an interval flanked by marker M i and M i +, the MCIM method assumes the following model. y j = µ + ax ( ) + u ( ) e ( ) + ε A j k M kj M k j where y is the trait value for individual j, µ is the population mean, a is the j additive effect of the putative QTL and x A( j) is coefficient for additive effect, e M (k ) is the random effect of marker k with its coefficient u, and ε is the random residual effect. The model can also be expressed as the mixed linear model formula as follows: M (kj) j y = Xb + U V = σ U M M M R e M M U + e ' M ε = Xb + + σ I = e u= 1 u= 1 U U u u e u u R U ~ N( Xb, V ) ' M (-9) where V is model s variance, σ = σ 1 M is the variance component of markers and σ = σ is the residual variance component. R 1 = RM e is known symmetric matrix of correlation coefficients and R = ' [ ' ], ( f and f = 1, m) RM = ρ,..., f f I is identical matrix. In above formula, m is the number of markers selected for background control and ρ f f ' = 1 rf f ' is the correlation coefficient between marker e and marker. M f e M f ' r f ' f is the recombination frequency between marker loci f and f.. Likelihood Analysis The log value of likelihood function for formula (-9) is specified as: n log( L( b, V )) = l( b, V ) = ln( π ) lnv ( y Xb) V ( y Xb) (-10) where the model s variance V can be calculated according to formula (-9) and the variance component σ u can be estimated by MINQUE-1 (Rao 1971; Rao 1997) or 36

43 REML method (Hartley and Rao 1967, Searle, 1970). The estimations of QTL effects b was obtained by the formula: ( X ' V X ) X ' V y bˆ = (-11) 3. Hypothesis Test Like IM and CIM methods, to search putative QTL within two flanking markers M i and M i + for the whole genome by setting a prior value for recombination frequency ˆ between marker and putative QTL locus Q. The likelihood rm i Q M i ratio statistic (LR) can be calculated by: ( bˆ, V ˆ, rˆ ) l ( bˆ, V ˆ, r 0.5) = i i LR (-1) l1 M Q 0 M Q = Therefore, the LR profile for whole genome can be plotted and the QTLs can be located according to the LR profile. 4. A Model for GE Interaction If QTL mapping experiments was conducted in several environments for individuals sampled from the same DH population, QTL genetic main effects and GE interaction effects can be evaluated by MCIM method. When experiment data obtained from different environments need to be analyzed, environment effects are usually treated as random effects. The additive model (-9) can be expended to include interaction effects for additive, replication, and marker effects. The trait value measured on the j th individual in the h th environment and b th replication can be expressed as: y hjk = µ + ax + u + u um ( fj) e M ( f ) + ume( lhj) eme( lh) + u B( k ) eb( k ) + f A( j) E( hj) e l E( h) AE( hj) e AE( h) + e hjk where µ is the population mean, a is the additive main effect for searching QTL and x A( j) is coefficient for genetic main additive effect, u E(hj ) is environment effect with u E(hj) e AE(hj ) its coefficient, is the additive by environment interaction effect with its 37

44 coefficient, is the random main effect over environments for the f th u AE(hj) marker genotype with its coefficient. is the marker by environments interaction effect with its coefficient. is the replication effect with its u B(k ) e M ( f ) coefficient and e is the random residual effect. hjk u M ( fj) u ME(lhj) e ME(lh) e B(k ) The model can also be expressed as the mixed linear model formula y = Xb + U = Xb + V = σ U = 6 E u= 1 u= 1 E u 6 E U σ U e ' E u E U + U u + σ U u e u R U + U ~ N( Xb, V ) AE ' u AE e AE AE U ' AE M e M + σ U M + U M ME R M e U ME ' M + U e B ME B + σ U + e ME ε R ME U ' ME + σ U U B B ' B + σ I e (-13) By using model (-13) and formula (-10), QTLs can be searched (according to the LR values) by mixed linear model approaches after using data for all individuals across multiple environments and replications. When a QTL is found, its position on the chromosome and genetic main effects (formula -11) as well as GE interaction effects was obtainable by the mixed linear model approaches. GE interaction effects can be predicted by BLUP method (Zhu 1999). eˆ u = σ u ( U R 1 u u ' ) Q y Q = V V X ( X ' V X ) X ' V 5. A Model for QTL Epistasis The following model can be used for two-way searching the QTLs with digenetic epistatic effects when the population is B 1 or DH. y k = µ + a i xik + a j x jk + a w ij ijk + m mm u M ( fk) e M ( f ) + f = 1 h= 1 u e + e MM ( hk) MM ( h) ε where the is the trait value of individual k and µ is the mean of the population. y k ai and a are the additive effects for the two putative QTLs i and j at two testing j p i p j aa ij point and. is the digenetic additive by additive epistatic effect between 38

45 the QTLs i and j. x, and are the coefficients for the effects of QTL i, ik x jk w ij QTL j, and QTL epistasis respectively. e M ( f ) is the random effect of marker f and e MM (h) u M ( fk ) is the random effect of the two-locus marker interaction between two markers. and u are the coefficients. is the random residual effect. MM (hk ) e ε The model can also be expressed as following mixed linear model format: y = Xb + U V = σ U M M M R e M M U + U e MM ' M + σ MM MM U + e ε MM R MM = Xb + U ' MM 3 u= 1 U + σ I = e u R U u ' u 3 u= 1 ~ N( Xb, V ) σ U u u R U u ' u (-14) where R MM is known symmetric matrix of correlation coefficients for marker interaction: R MM = [ ] ρ hh' (h, h = 1,,, mm) and ρ 1 ( 1 rab )(1 r cd ) (1 rij )(1 ri ' j' ) hh ' = ρ ij _ i' j' = i < j, i' < 4 rij (1 rij ) r i ' j ' (1 ri ' j' ) j' The set of (i, j, i, j ) equal to set of (a, b, c, d) and a < b < c < d in the whole genome base. The basic idea of mapping QTLs with marginal and two-ways epistatic effects is through the two-dimensional searching along the whole genome. For each of the two testing points and within two intervals each flanked by two markers, the p i p j LR value can be calculated by using formula (-1) and the QTLs can be located by analysing the LR profile. -5 Multiple Interval Mapping Multiple interval mapping (MIM) is a multiple QTL oriented method combining QTL mapping analysis with the analysis of genetic architecture of quantitative traits through a search algorithm to search for number, positions, effects and interaction of significant QTL simultaneously. For m putative QTL, the multiple interval mapping model for a B 1 population is defined by: 39

46 m t * yi = + α r xir + r= 1 r< s rs * * ( xir xis ) + ei µ β (-15) where y i is the trait value of individual i and µ is the mean of the model. α r is the additive (marginal) effect of putative QTL r and * x ir is the coefficient, which is unobserved but can be inferred from maker data in sense of probability, β rs is the epistatic effect between putative QTL r and s and t is the number of significant pairwise epistatic effects, e i is the random residual effect. The likelihood function of the data given the model (-15) is a mixture of normal distributions as follow: L, µ yi µ + D j E, σ ) i= 1 j= 1 m n ( E, σ ) = pijφ( where p ij is the probability of each multilocus genotype conditional on marker data, E is a vector of QTL parameters ( α s and β s), D j is a vector specifying the * configuration of x s associated with each α and β for the j th QTL genotype, ( y µ,σ ) φ denotes a normal density function for y with mean µ and variance and n is the number of individuals. MIM method consists following four components: - An evaluation procedure designed to analyse the likelihood of the data given a genetic model (number, positions and epistasis of QTL) (Kao and Zeng 1997). - A search strategy optimised to select the best (better) genetic model in the parameter space. - An estimation procedure for all parameters of the genetic architecture of the quantitative traits simultaneously given the selected genetic model. - A prediction procedure to estimate or predict the genotypic values of individuals based the selected genetic model and estimated genetic parameter values for marker-assisted selection. σ 40

47 Among these components, the second point is the critical part for the MIM method. In next chapter, the simulation studies have been conducted for selecting criteria in the model selection framework. 3. Simulation Studies 3-1 Simulation Model and Data In this section, the model and method for producing simulation data of QTL mapping experiments will be discussed. The simulation data include two parts that are mapping information and QTL information. 1. Genetic Model for Simulation The following is a general genetic model for B 1 or DH population with m QTLs. y i m = µ + α x + β ( x x ) + e (3-1) r ir rs r= 1 r< s 1... m ir is i where y i is the trait value of individual i and i is the indexes of the individual in population ( i = 1,,, n). µ is the mean of the model. αr is the marginal effect of QTL r and is an indicator variable denoting genotype of QTL r. is defined x ir by ½ and -½ for B1 population and 1 and 1 for DH population. is the epistatic effect between QTL r and QTL s and m is the number of QTLs chosen for x ir β rs simulation, e i is the residual effect of the model assumed to be normally distributed with mean zero and variance σ =. V e The variance of model (3-1) can be partitioned into several components such additive variance, epistatic variance, and residual variance. Formula (3-A) and (3-3A) are the additive and epistatic variances for B 1 population and formula (3-B) and (3-3B) give out the additive and epistatic variances for DH population. ( E) = E( G ) [ E( G) ] = VA + VI Ve Var ( y) = Var( G) + Var + 41

48 1 1 V A = α i + α iα j (1 rij ) (3-A) 4 i i< j i< j V = α + α α (1 r ) (3-B) V V A I I i i 1 16 i i j ij β ij β kl (1 rij )(1 rkl ) β ij (1 rij ) (3-3A) i j = 1 4 i < j, k< l < β ij β kl (1 rij )(1 rkl ) β ij (1 rij ) (3-3B) i j = < j, k< l < where r ij is the recombination frequency between QTL i and QTL j.. Parameter Setting The first step of producing simulation data is to set the mapping parameters, such as experimental population (B 1 or DH), sample size (n), trait mean (µ), map function (Haldane or Kosambi), and marker genotypes (for example, 1 for one genotype and 0 for another genotype). Especially, it is important to define chromosome information such as chromosome number, marker number and positions for each chromosome. Table 3-1 shows an example of parameters setting for QTL mapping information. Table 3-1. An example of parameters setting for simulation mapping information. Sample Trait Map Marker genotype Population Size Mean Function Chromosomes Mm MM B Haldane The second step is to set the parameters of QTLs such as heritability (h ), the ratio of epistatic variance by additive variance C, which is defined as V I / V A (see formula 3-A and 3-3A), QTL number, positions, and effects. One example of the parameters setting is showed in Table 3-. By using this information, it is easy to produce the additive (α) epistatic (β) upper-triangle matrix as showed in Table 3-3. The QTL effects can be adjusted according to h, C, and V e as following. Table 3-. An example of parameters setting for QTL information. Additive Effect Epistatic Effect QTL Number Heritability C = V I / V A 1 Sign : Both (1:3) Sign : Same Distribution : γ-.1 Distribution : γ Effects can be same direction or both directions, in which case, a ratio can be indicated. Effects 4

49 can be chosen for different distributions, such as gamma (with one parameter), normal or even. Assume heritability is h and 1 h C = V I / V A then V e = V G. h Note: We can use formula (3-A), (3-B) and (3-3A), (3-3B) to calculate V I and V A. After setting the values of α i and β ij, the β ij s value should be adjusted according to the value of C. VI If R = 1 then β ij = β ij / R to ensure that R = 1 and C = V I / V A. CV A Finally, to standardize the QTL effects by adjusting the values of α and β using formula α V e and β V e and to make sure that the value of is equal to 1. V e Table 3-3. An example of Simulation parameters setting for positions and effects of QTLs. Here, V A = 1.364, V I = 0.136, Ve = 1.0, C = V A / V I = 0.10, h = Chromosome Positions (cm) QTLs Simulation Procedure The marker genotype data and trait value for each individual can be produced according to the mapping information and the QTL information. The basic simulation strategy is to walk along the chromosomes and treat the marker positions and QTL positions alike. The difference between marker and QTL is that if a marker is reached, just record the marker genotype (0 or 1) and for a QTL, the QTL additive and epistatic effects should add into the trait value for current individual. For each individual, the simulation starts from the first marker of each chromosome. By 50% chance, the first marker genotype will be 0 or 1 and record it. To next marker or QTL position, the chance of obtaining certain type of genotype is according to the recombination frequency between previous position and the current 43

50 position. For example, if the distance between these two positions is 10cM and the Haldane map function has been used, according to formula (-1), the recombination frequency was Therefore, the current genotype will be of difference with previous one only by the chance of 9.1%. After deciding the genotype for current position, we can record the genotype value or add QTL additive effect into the trait value. The procedure will continue until all markers and QTLs have been reached. Then, the QTL epistatic effects can be add. After adding the trait mean and the random residual effect, the trait value for current individual was obtained. 4. Format of the Simulation Data Table 3-4 shows an example of QTL mapping simulation data. The first part of the data is the marker genotype that is the records for every marker position of the whole genome. For inbred line, the possible marker genotypes are 3 for Intercross population and for Backcross, DH, and RIL population. Usually we use different numbers to represent the different marker genotypes. To use denote genotype AA, 1 to denote Aa, and 0 to denote aa is one of the examples. The second part is the trait value, which is the joint effect of several factors. These factors include trait mean value, heritability, and QTL positions and effects (additive, dominance for intercross population, and possible epistatic effects). In order to analyse the simulation data, other information besides the marker data and trait value are needed as well. That includes the map information such as map function, marker positions, and population types etc. Table 3-4. An example of the simulation data with 5 individuals. Individuals Marker Data Trait Value

51 3- Single Marker Analysis In this section, the simulation study has been conducted for single marker analysis. The simulation design is based on: replications of 500, sample size of 00, B 1 population and trait mean of 15.6, Haldane mapping function, total chromosome number of 3 with marker number of 1, 11, and 15 respectively, average marker distance of 10 cm with positions having certain deviation (see Table 3-5). We set totally 5 QTLs for the whole genome and the heritability is 0.6. Among the QTLs, only one QTL is set for chromosome 1 and chromosome. The other 3 linked QTLs have been set on the chromosome 3. In Table 3-5, the t statistic (t-val) for each marker is calculated by formula (-3) and the LR value is obtained according to formula (-4). This analysis is also fitting the data to the simple linear regression model Y i = 0 + β1x i β + ε i and the estimators of and β for each marker is also being estimated. The t statistic is for the β 0 1 hypothesis that the marker is unlinked to the quantitative trait. The column headed by Pr(t-Val) is the probability that the trait is unlinked to the marker. Significances at the 5%, 1%, 0.1% and 0.01% levels are indicated by *, **, *** and ****, respectively. For the QTL with median effect (0.754) in chromosome one, the estimation of QTL position is reasonable accurate by the indication of significance level. However, the range for the estimation of QTL position is much wide (marker 6 to marker10) for the QTL with large effect (-1.331) in chromosome two. In the multiple-linked QTL situation on chromosome three, there is no way to distinguish these three QTLs because almost all markers have very high significance levels. All QTL effects cannot be estimated by using single maker method because the QTL positions and effects are confounded. From this simulation study, it is clear that the single marker method have the power to detect the markers associated with the existed QTLs. 45

52 Table 3-5. Simulation result of single marker analysis (average of 500 replications). 1 Chr QPos 3 Effect 4 Mk 5 MPos t-val LR β 0 β 1 Pr(t-Val) * * ** *** ** ** * * * ** ** **** **** **** **** **** *** **** **** **** **** **** **** **** **** **** **** **** *** ** * * 1 Chromosome. QTL position in cm. 3 QTL effect. 4 Marker number. 5 Marker position in cm. 46

53 3-3 Comparing Different Mapping Method It is helpful to know the advantages and disadvantages of QTL mapping methods before choosing them for a particular QTL mapping experiment. In this section, using DH as the model population, simulation studies were conducted for comparing the performances among three methods of IM, CIM, and MCIM under the simple additive model. The information was presented for QTLs about the positions and effects, detection power, and probability of false QTLs detected. 1. Parameters Setting In this study, the simulation design is based on: replications of 500, sample size of 00, population mean of 15.6, Haldane mapping function, total chromosome number of 9, marker number of 11 for each chromosome, and average marker distance of 10 cm with positions having certain deviation (Figure 3-1). We set totally 7 QTLs for the whole genome and the heritability is 0.6. Among these QTLs, there are QTLs with large effects, 3 QTLs with median effects, and QTLs with small effects. There are opposite sign for 1 QTL with median effect and 1 QTL with small effect as compared to the other QTLs. According to the QTL number in one chromosome, we have constructed two different QTL models: Model-I has only one QTL and Model-II has multiple QTLs for one chromosome.. Estimation of QTL Effects The estimation of QTL effects for the one QTL Model-I and multiple QTL Model-II by using IM, CIM, and MCIM methods has been showed in Table 3-6. The estimators were obtained by averaging all effects on each known QTL position over the 500 replications. For Model-I, the estimated QTL effects are very close to the parameter value. It is implied by the results that the estimation of QTL effects is unbiased for all the three QTL mapping methods (IM, CIM, and MCIM) under the one QTL model. Unlike Model-I, the estimation of QTL effects has small bias on the multiple QTL Model-II due to the linkage between QTLs. For the two QTLs with large effects (Q-1L and Q-L), the effects have been apparently overestimated by 47

54 IM method. The QTL effects have been underestimated by all the three mapping methods for the two QTLs with small effects (Q3-1S and Q3-S). The situation is mixture for the three median-effect QTLs (Q1-1M, Q1-M, and Q1-3M) with some QTL effects underestimated and some overestimated. Especially for the QTL Q1-3M, the bias is quite serious for the IM method. However, the estimation bias of QTL effects is quite small for QTLs with large and median effects, especially by using the CIM and MCIM methods. Table 3-6. The simulation results of the QTL effect on the QTL positions for Model-I and Model-II. 1 QTLs MODEL-I MODEL-II Eff IM CIM MCIM QTLs Eff IM CIM MCIM Q1-1L Q1-1M Q-1M Q1-M Q3-1S Q1-3M Q4-1L Q-1L Q5-1M Q-L Q7-1M Q3-1S Q9-1S Q3-1S QTL = QTL with chromosomal number and serial number followed by effect (L-large, M-median or S-small), Eff = effect of QTLs. 3. Power and False Positive Simulation results were presented for power of QTL detection and probability of false QTL identified under the different thresholds by using IM, CIM, and MCIM methods for Model-I and Model-II (Table 3-7 and Table 3-8). These kinds of information were obtained by analysing the LR peaks from the LR profile for each chromosome. A detected QTL is defined by having a valid LR peak with the highest LR value that is greater than a predefined threshold. If a detected QTL matched with the predefined QTL, the QTL will then be counted for calculating power of QTL detection. However, if the detected QTL cannot match with any predefined QTLs in the same chromosome, and it will be counted as a false QTL. It is obvious that the predefined threshold value is very important for mapping QTL. By decreasing the threshold value, it will increase the power of QTL detection and the probability of false QTL detected. The reverse is also true. 48

55 Table 3-7. Power of QTL detection and the probability of false QTL detected under different thresholds for Model-I. QTL 1 LOD =.0 LOD =.5 LOD = 3.0 IM CIM MCIM IM CIM MCIM IM CIM MCIM Q1-1L Q-1M Q3-1S Q4-1L Q5-1M Q7-1M Q9-1S FQTL FQTL FQTL LOD = threshold, FQTL1 = the probabilities of false for detecting one QTL in the whole 3 4 genome, FQTL = the probabilities of false for detecting two QTLs, and FQTL+ = the probabilities of false for detecting more than two QTLs. For Model-I (Table 3-7), the three mapping methods were equally efficient in detecting QTLs with large effects (Q1-1L and Q4-1L). However, QTLs with very small effect (Q3-1S and Q7-1S) could only be detected with very low efficiency. But the power of detecting QTL with median effect will be affected by choosing different QTL mapping methods and various threshold values. CIM method tended to have the highest power values among these three mapping methods. While MCIM method is more efficiency than IM method. In case of the probability for false QTL detection, IM method in general gave more false QTLs under the three threshold values. The methods of CIM and MCIM had similar likelihood of finding one false QTL. But CIM method was better than MCIM method when considering two and more false QTLs detection. For multiple QTL model (Model-II in Table 3-8), all three mapping methods have the high efficiency for detecting QTLs on the same chromosome with large effects (Q-1L and Q-L). If there are QTLs with very small effect (Q3-1S and Q3-S) on one chromosome, they are almost undetectable by these three methods. IM method cannot detect the QTL with negative median effect (Q1-3M), which was linked to QTLs with positive effect (Q1-1M and Q1-M). CIM method tended to be more efficiency than MCIM method expect for one QTL (Q1-M) being closely linked to 49

56 another (Q1-1M) with the same direction of effects. IM method gave high probability of false QTL detection as compared to other two mapping methods. CIM method tended to have smaller likelihood of finding false QTLs. Table 3-8. Power of QTL detection and the probability of false QTL detected under different thresholds for Model-II. LOD =.0 LOD =.5 LOD = 3.0 QTL IM CIM MCIM IM CIM MCIM IM CIM MCIM Q1-1M Q1-M Q1-3M Q-1L Q-L Q3-1S Q3-S FQTL FQTL FQTL It is implied that the density of the genetic marker will affect both the power of QTL detection and the probability of false QTL detected as showed in Table 3-9. When marker density increases, there is no apparent gain of power for detecting QTLs with large effects (Q-1L and Q-L) by three QTL mapping methods. But MCIM method tends to be more powerful than the other two methods (IM and CIM) for detecting QTLs with small effects. When considering the power of detecting linked QTLs with reverse effects (Q1-M and Q1-3M), MCIM method has a great improvement, while CIM method performs quite poor. It may suggest that increasing marker density is sometime even harmful for the CIM method. The QTL Q1-3M is still cannot be detected by IM method as the marker density increased. The impact of the sample size on the power of QTL detection and the probability of false QTL detected is showed in Table Basically, the power of the QTL detection will increase as the sample size increased for all the three mapping methods. Especially, the CIM method has obtained large improvement both in power of QTL detection and probability of false QTL detected after the sample size is increasing to

57 Table 3-9. Power of QTL detection and the probability of false QTL detected under Model-II when chromosomes = 3, average marker distance = 4 cm, and threshold value is LOD =.5. QTL IM CIM MCIM Q1-1M Q1-M Q1-3M Q-1L Q-L Q3-1S Q3-S FQTL Probability of false QTL detected in whole genome. Table Power of QTL detection and the probability of false QTL detected under Model-II for different sample sizes (threshold value with LOD =.5). QTL Samples = 100 Samples = 300 IM CIM MCIM IM CIM MCIM Q1-1M Q1-M Q1-3M Q-1L Q-L Q3-1S Q3-S FQTL The performance of the QTL mapping analysis will also be affected by the adjusted factors of the method itself. Before the QTL mapping analysis, the CIM method needs to set the parameters such as window size and control marker numbers. In this simulation study, we simply use the default parameters and that is 10 cm for the window size and 5 for the control marker numbers. However, sometimes the change of these parameters in CIM method has a great influence on the power of QTL detection and the probability of false QTL detection as showed in Table On the other hand, because the MCIM method treats the background control markers as random effects, the influence of the control markers is much less than that of CIM method. 51

58 Table Power of QTL detection and the probability of false QTL detected under different number of background control markers in model-ii (threshold value with LOD =.5). QTL CIM MCIM 1 Mn = 5 Mn = 10 Mn = 5 Mn = 5 Mn = 10 Mn = 5 Q1-1M Q1-M Q1-3M Q-1L Q-L Q3-1S Q3-S FQTL Mn = Number of control marker. 4. Positions and Effects of Detected QTLs The summary of the position estimation and the 95% experimental confidence interval (ECI) for detected QTLs was presented in Table 3-1 for Model-I and Model-II with threshold setting to LOD =.5. For the two QTLs with large effects, the estimation of position is quite accurate with small ECI for all three mapping methods. The average range of ECI is 14cM, 8.3cM, and 9.5cM for IM, CIM, and MCIM methods. Unlike the CIM and MCIM methods, the average range of ECI increases largely (11cM to 17cM) from Model-I to Model-II for the IM method. As the QTL has median effect, the estimation of the QTL position becomes less accurate and the ECI becomes larger. For example, the average range of ECI is almost doubled for the median effect QTLs in Model-I by using CIM and MCIM methods (15cM for CIM and 0cM for MCIM). For the two small effect QTLs, it is difficult to obtain a good estimation for the QTL position and a reasonable ECI because this kind of QTL can only be detected very few times in 500 replications due to the extreme low power of QTL detection. For the single QTL Model-I, the estimated effects of detected QTLs for the two large QTLs (Q1-1L and Q4-1L) are unbiased as showed in Table However, the estimation of QTL effects tends to be overestimated for the QTLs with median and small effects. The reason is that the detection power for this kind of QTL is much less than 100%. That is, we only pick the large LR peak (greater than the predefined threshold value) as the identified QTL for each replication. It is obvious that the large 5

59 LR peak tends to have the large estimation of QTL effect as compared to the small LR peak. Therefore, in the real QTL mapping situation, if you identified a QTL with median or small effect, it is likely to have slightly overestimated effect. The overestimation in QTL effect could be larger for two linked QTLs as Q-1L and Q-L at Model-II. This may imply that the QTL linkage will affect the estimation of QTL effects. To compare the three QTL mapping methods, CIM method performs well for the estimation of QTL effects and the ECI for QTLs with median effects, partially due to the high power of QTL detection for these kinds of QTL. Table 3-1. The simulation results of the position for the detected QTLs under the Model-I and Model-II when the threshold value is setting to LOD =.5. Genome Model-I Model-II QTL 1 Pos IM CIM MCIM Est 3 ECI Est ECI Est ECI Q1-1L ( 8 17) 1.4 (8 17) 1.4 (8 16) Q-1M (18 45) 7.0 (0 36) 9.6 (18 4) Q3-1S (69 89) 81.6 (77 89) 79.1 (6 90) Q4-1L (76 89) 8.5 (78 87) 8. (76 88) Q5-1M (85 10) 97.5 (87 10) 96.7 (84 10) Q7-1M ( 0 1) 9. (5 19) 8.1 (0 18) Q9-1S (41 45) 53.1 (39 66) 46.6 (36 56) Q1-1M (4 6) 16.8 (8 3) 16. (4 6) Q1-M (8 37) 31.1 ( 6 41) 30.5 (8 4) Q1-3M (75 75) 75.4 ( 66 83) 75.1 (64 84) Q-1L (4 0) 8.8 ( 6 1) 9.1 (4 1) Q-L (68 86) 8.5 ( 77 86) 81.9 (76 86) Q3-1S (44 48) Q3-S (85 87) ( ) (90 106) 1 Pos = position of QTLs. Est = estimated QTL position, 3 ECI = 95% experimental confidence interval for QTL position. Note: the blank table cell with is caused by 0 detection power. 53

60 Table The summary of the effects for the detected QTLs under the Model-I and Model-II when the threshold value is setting to LOD =.5. Genome QTL 1 Eff IM CIM MCIM Est 3 ECI Est ECI Est ECI Model-I Q1-1L ( ) 1.53 ( ) 1.59 ( 1..1) Q-1M ( ) 0.77 ( ) 0.86 ( ) Q3-1S ( ) 0.59 ( ) 0.67 ( ) Q4-1L ( ) 1.49 ( ) 1.54 ( 1.1.0) Q5-1M ( ) ( ) ( ) Q7-1M ( ) 0.76 ( ) 0.83 ( ) Q9-1S ( ) ( ) ( ) Model-II Q1-1M ( ) 1.08 ( ) 1.09 ( ) Q1-M ( ) 0.93 ( ) 1.07 ( ) Q1-3M ( ) ( ) ( ) Q-1L ( 1.4.1) 1.4 ( ) 1.53 ( 1.1.0) Q-L ( 1.4.1) 1.40 ( ) 1.53 ( 1.1.0) Q3-1S ( ) Q3-S ( ) 0.55 ( ) 0.71 ( ) 1 Eff = effect of QTLs. Est = estimated QTL effect, 3 ECI = 95% experimental confidence interval for QTL effect. 5. The LR Profile For these three QTL mapping methods (IM, CIM, and MCIM), the average mapping results of the two QTL models were showed in Figure 3-1. For Model-I, all three mapping methods performed quite well because of the unbiased estimation of QTL positions and effects as well as the LR values depended on the QTL effects. For Model-II, the two QTLs with small effects (Q3-1S and Q3-S) are undetectable by all the three methods of QTL mapping. These three methods have very larger power to detect QTLs with large effects (Q-1L and Q-L). However, comparing to CIM and MCIM methods, IM method has more noise between these two QTLs with large effects and this kind of noise could be harmful when these LR peaks were considered as QTLs. For the three QTLs with median effects on chromosome 1, the highest LR value is obtained by CIM method for Q1-1M and Q1-3M, but by MCIM method for Q1-M. IM method has very low LR value for Q1-3M with the possible reason of no 54

The long vertical bars are chromosomes and the short vertical bars are QTL positions and effects.

61 Q1-1L Q-1M Q3-1S Q4-1L Q5-1M Q7-1M Q9-1S Q1-1M Q1-M Q1-3M Q-1L Q-L Q3-1S Q3-S Figure 3-1. The simulation 500 average QTL mapping LR profiles and additive effect profiles for the two QTL setting models. The long vertical bars are chromosomes and the short vertical bars are QTL positions and effects. The small dots distributed along the horizontal bars are genetic markers. Only the chromosomes with QTL (1,, 3) have been showed for Model-II. 55

Notes on QTL Cartographer

Notes on QTL Cartographer Introduction QTL Cartographer is a suite of programs for mapping quantitative trait loci (QTLs) onto a genetic linkage map. The programs use linear regression, interval mapping