User Manual ixora: Exact haplotype inferencing and trait association

User Manual ixora: Exact haplotype inferencing and trait association June 27, 2013

Contents 1 ixora: Exact haplotype inferencing and trait association 2 1.1 Introduction.............................. 2 1.2 Requirements and availability.................... 3 1.3 Input format without phenotypic information........... 4 1.3.1 Ped file format........................ 4 1.3.2 ixora file format....................... 5 1.4 Input format with phenotypic information............. 5 1.5 Input with completely missing parents............... 6 1.6 Output format............................ 6 1.6.1 Phasing results in.txt files................. 6 1.6.2 Error measures in.log file.................. 7 1.6.3 Statistics in.stats file.................... 7 2 ixora graphical user interface 9 2.1 Executing haplotype inference and association analysis...... 9 2.1.1 Specifying input and output files.............. 9 2.1.2 Options and parameters................... 9 2.2 Visualization............................. 10 2.2.1 Haplotype frequency..................... 10 2.2.2 Expected haplotype frequency distribution........ 10 2.2.3 Haplotype frequencies by phenotype............ 10 2.2.4 P-values for haplotype-phenotype association....... 11 2.3 Troubleshooting............................ 11 3 Command line executable for exact haplotype inference 12 3.1 Parameters.............................. 12 1

Chapter 1 ixora: Exact haplotype inferencing and trait association 1.1 Introduction ixora is a framework for inferring haplotypes from genotyped population data, and for associating observed phenotypes with the inferred haplotypes, proposed in [1]. The framework is especially applicable in plant breeding where there exist large populations of individuals from the same parents. Given a set of genotypes for progeny of at most two heterozygous parents, ixora efficiently and accurately extracts all the equally-likely haplotypes of the progenies to produce an agglomerate structure. The structure can be conveniently visualized as mosaics of the ancestor haplotypes, and statistics can be computed on the distribution of crossovers in the progeny. Furthermore, the results of the phasing can then be applied as input to statistical tests and visualization methods to find genomic regions and the specific haplotypes in those regions associated with observed phenotypes. ixora has an option to phase the progeny even in the complete absence of genotype information for the parents. ixora performs the following two main steps: 1. Exact haplotype inference: this is done via a rigorous mathematical analysis that examines the space of all the best possible haplotype solutions, as described in detail in [1]. 2. Statistical analysis of association, per genotyped marker, between the haplotypes and phenotypes: ixora identifies genomic regions of interest when i) the phenotype is determined by the haplotypes inherited from both parents or ii) the phenotype is determined by the haplotypes inherited from just one parent. While performing the latter step, ixora computes the appropriate inputs to statistical tests for haplotype-phenotype association. The current implementa- 2

tion includes Fisher s exact test for this purpose. ixora outputs the resulting p-values for marker-phenotype association as a text file and visualizes the results. An added value of ixora is the built-in feature to perform randomization tests on the phenotypes, to establish significance thresholds on the p-values. ixora reports these limits and visualizes them along with the p-values on the real data. In addition, ixora outputs error measures on the phasing step, summarizing the ambiguity in the data regarding haplotype inference. 1.2 Requirements and availability The ixora stand-alone framework executed via a graphical user interface has the following dependencies (tested on the versions listed): R (version 2.14), Java (version 1.6), and JFreeChart (version 1.0.14). R is required for the statistical Fishers test for phenotype association and is not required when performing haplotype inference only. JFreeChart is required for the visualization via the GUI. In addition, the haplotype inference algorithm is provided as a separate command line version in the form of a C++ executable. ixora dependencies include: jfreechart-<version>.jar - This file must be located in the correct Java directory (...\lib\ext\), for example: C:\ProgramFiles\Java\jre7\lib\ext \jfreechart-1.0.14.jar. To obtain this file, download the latest jfreechart release from http://sourceforge.net/projects/jfreechart, the file is located inside the lib folder jcommon-<version>.jar - This file must be located in the correct Java directory (...\lib\ext\), for example: C:\ProgramFiles\Java\jre7\lib\ext \jcommon-1.0.17.jar. To obtain this file, download the latest jfreechart release from http://sourceforge.net/projects/jfreechart, the file is located inside the lib folder. Rscript.exe - The user needs to specify the directory where R (including this file) is installed. The Rpath.txt file included in the ixora download will need to be modified to reflect this path, for example: C:\Program\ Files\R\R-2.15.1\bin\Rscript.exe. R can be downloaded from http:// www.r-project.org/. We have tested the system on Windows and Linux, ixora should work on any operating system as long as the dependencies are included. Currently, the ixora implementation is designed to work with biallelic markers on a population of diploid individuals derived from at most two parents. For the phenotype association analysis, the parents and progeny are required to have discrete phenotypes (such as resistant/susceptible, green/yellow/red, etc.). ixora executable files are available for download at http://researcher.ibm.com/project/3430 The zipped folders contain six files: 3

ixora.jar - executable Java file for the entire ixora framework, including graphical user interface ixora.exe - executable C++ program for the exact haplotype inference algorithm Rpath.txt - one line text file specifying the path to where the user has R installed LICENCE.txt - copy of the licence agreement that the user agreed to when downloading the program README.txt - details on the dependencies and limitations of the current implementation ExampleData.txt - an example input file for ixora If ixora is used in published analysis, it should be cited as: Utro, F., Haiminen, N., Livingstone, D., Cornejo, O.E., Royaert, S., Schnell, R.J., Motamayor, J.C., Kuhn, D.N., Parida, L.: ixora: Exact Haplotype Inferencing and Trait Association. Submitted (2012). For further information visit http://researcher.ibm.com/project/3430 or contact parida@us.ibm.com. 1.3 Input format without phenotypic information ixora takes as input two file formats: the commonly used.ped file format and a custom file format referred as ixora file format. For brevity, in what follows the main features of the.ped file format is presented, due to its large use in the scientific community. Each chromosome should be stored in a separate input file and ixora run independetly on each chromosome. 1.3.1 Ped file format If the.ped file format is used, the first two lines must correspond to the parental information. The input file should be formatted as follows: 1. The first file columns must contain a string for each of the following field: family, person, father, mother and sex; 2. The following columns must contains the genotype information separated by a space. 3. The first two lines contain the parental information. 4. the following n lines correspond to the n progenies. 4

1.3.2 ixora file format For n progenies at m loci, the input file should be formatted as follows, as shown in Table 1.1: 1. The first line contains the m markers name separated by space. 2. The second and third line contain the parental sequences. Each line contains m + 1 fields: the parent identifier followed by the m genotypes separated by space. A missing value is represented by dash. Even if all the parents values are missing, they still need to be included in the input as - -. A separate option for handling the case with completely missing parental genotypes is described in Section 1.5. 3. The following n lines correspond to the n progeny. Each line contains m + 1 fields: the progeny identifier followed by the m genotypes separated by space. A missing value is represented by dash. 4. There should be no line change at the end of the last line and no space between the two alleles per genotype (i.e. should be AC, not A C or A/C ) marker1 marker2 maker3 marker4 Parent1 CC TT AG GG Parent2 AC TG AG TG Prog1 CC - - GG TG Prog2 AC TG AG GG Prog3 AC TG AG GG Prog4 CC TT GG - - Prog5 CC TT AG TG Table 1.1: An example input file for five progeny sequences 1.4 Input format with phenotypic information When the user wishes to specify phenotypic information for the progeny, this should be included as the first column after the individual s name, as demonstrated in 1.2. The.ped file format is not currently supported in conjunction with phenotype information. The phenotype values must be discrete and start from 0, i.e. the phenotypes can be 0,1,2,... For missing phenotype data the value can be set to -1. The phenotype is then followed by the genotype information as described in the previous section. 5

marker1 marker2 maker3 marker4 Parent1-1 CC TT AG GG Parent2-1 AC TG AG TG Prog1 1 CC - - GG TG Prog2 0 AC TG AG GG Prog3-1 AC TG AG GG Prog4 1 CC TT GG - - Prog5 0 CC TT AG TG Table 1.2: An example input file for five progeny sequences including phenotypic information. 1.5 Input with completely missing parents When the parental genotype values are missing for all markers, they should be specified as - - in the input, and the special option missing parents selected in the user interface or in the command line. This will invoke ixora to run in a special mode to determine the parental genotype values. 1.6 Output format ixora outputs, by default, the phasing results in three files per parent, plus an additional log file describing the preciseness of the phasing solution. 1.6.1 Phasing results in.txt files The ixora executables generate two text files for each parent. They are named <input> <parent name>.txt and <input> <parent name> phased.txt, where <input> is the name of the input file, and <parent name> is the name specified as the name for the parent in the input file, for example Parent1 and Parent2 in Table 1.2. In both files, the first the two lines correspond to the parent s phased haplotypes, denoted <parent name> H1 and <parent name> H2. The following n lines correspond to the progeny. In the file <input> <parent name>.txt each of the n progeny lines satisfy the following format, an example is shown in Table 1.3: The first column contains the name of the progeny as indicated in the input file The second column is a string of m characters, separated by white space (tab). Each character could assume one of the following values: 1: denoting the allele comes from the first parental haplotype H1 2: denoting the allele comes from the second parental haplotype H2, q, Q: representing ambiguous values. They indicate multiple equally likely solutions, possibly with multiple crossovers 6

b: denoting that are potential sources of mistakes, such as markers with a high number of missing data, or imputation of parents values from the progeny. The user may choose to replace it with the numerical values in the columns to the left and right of such markers. E: denoting the individual has too many missing values to obtain a phasing result for it The files <input> <parent name> phased.txt follows the same standard but provides a possible haplotype assignment for the, q, Q and b characters, providing a simplified, less precise output format. Parent1 H1 A T T C T T T T T A Parent1 H2 A T T C C T T C C A Prog1 1 1 1 1 1 1 1 1 2 2 Prog2 1 1 1 1 1 1 1 1 1 1 Prog3 2 2 2 2 2 2 2 2 2 2 Prog4 1 1 1 1 1 1 1 1 1 1 Prog5 2 2 2 2 2 1 1 1 1 1 Table 1.3: An example phasing output file <input> <Parent1>.txt, showing the phasing results for the Parent1 and for five progeny. For added convenience, the file <input> <parent name> phased sequence.txt provides the haplotype sequences corresponding to the simplified phasing results in the file <input> <parent name> phased.txt. 1.6.2 Error measures in.log file An additional output of ixora is the <input>.log file. This file contains the error measures relating to the preciseness of the phasing solution, as disucssed in [1]. Delta, D, and E denote the distance from the lower bound regarding the number of crossovers, ambiguity in the solution, and errors in the data. If these values are large, it may not possible to obtain a reliable phasing on the input data. 1.6.3 Statistics in.stats file The phasing step in ixora outputs an optional <input>.stats file containing detailed analysis of the haplotype frequencies and association with phenotype. This file will be used as input to the various visualizations and statistical tests. The <input>.stats file contains the following elements, in the order that is as described below: 1. Expected haplotype count (c) and variance (delta), as disucssed in [1], per marker for each haplotype pair 7

2. Haplotype count (frequency) per marker for each parent 3. Expected haplotype count (c) and variance (delta) per marker for each haplotype pair, for individuals with phenotype 0 4. Haplotype count (frequency) per marker for each parent, for individuals with phenotype 0 5. Data 3. and 4. for each remaining phenotype 6. p-value from Fisher s exact test per marker for pair of haplotypes 7. p-value from Fisher s exact test on randomized data for pair of haplotypes 8. p-value from Fisher s exact test per marker for the 1st parent 9. p-value from Fisher s exact test on randomized data per marker for the 1st parent 10. p-value from Fisher s exact test per marker for the 2nd parent 11. p-value from Fisher s exact test on randomized data per marker for the 2nd parent 8

Chapter 2 ixora graphical user interface This chapter describes using the ixora program with graphical user interface ixora.jar for haplotype inference and statistical testing of haplotype-phenotype associations. ixora can be started by double-clicking on the ixora.jar executable. 2.1 Executing haplotype inference and association analysis This section describes executing the haplotype inference and association analysis. The haplotype analysis is started by pressing the RUN button after specifying the input file, output folder and desired options as specified below. 2.1.1 Specifying input and output files To begin an analysis, the user must specify an input file in the ixora (or.ped) format, in the by using the Browse button in the Select Input File section on the main dilog window. Similarly, Select Output Folder must be performed to specify the location of the resulting output files. 2.1.2 Options and parameters The following options are provided in the ixora main dialog window: If the input genotypes are derived from the selfing of a single parent, this can be indicated in the Self check box. When using this option, the user should specify two parents whose genotypes are identical, but their names different, for example: Parent1AsMother and Parent1AsFather. 9

If all the genotype values for the parents are missing, the Missing parents check box should be selected, and otherwise left unselected. When selected, ixora will run in a special mode to infer the parental genotypes. Note that in this case the labeling of the parents (Parent1 / Parent2) in the output is arbitrary, since both parents are completely missing in the input. If the input data file contains phenotype information, the Phenotype included check box should be selected, and otherwise left unselected. Number of randomizations denotes the iterations for estimating a significance threshold via permutation tests. The default value is 20 and minimum value is 1. Note that running many (i.e. several tens or hundreds) of randomizations can be time consuming 2.2 Visualization In this section we desribe the visualizations that ixora provides. The input to visualization is data contained in the phasing output statistics file. To run visualizations, the user must perform File Open from the main dialog menu to open the relevant.stats file. After performing this, the user can choose to perform the various visualizations given as options in the Visualize drop down menu in the main dialog window. The figures produced by ixora can be saved as.png figure files by rightclicking on them with the mouse and selecting Save as. It is possible to zoom in on the window contents by left clicking with the mouse on the left upper corner of the desird region and holding the mouse button down until the desired range is shaded. The range can also be manipulated by right-clicking with the mouse and selecting the desired actions. The visualization options are described in the following subsections. 2.2.1 Haplotype frequency Haplotype frequency will draw histograms of haplotype counts at each marker for each parent. 2.2.2 Expected haplotype frequency distribution Frequency distribution will draw the expected counts and frequencies of haplotype pairs at each marker. The shaded areas correspond to ambiguity in the phasing result. 2.2.3 Haplotype frequencies by phenotype Phenotypic charts will draw Haplotype frequency and Frequency distribution plots, after dividing the individuals into subsets based on their phenotype. 10

2.2.4 P-values for haplotype-phenotype association P-value will draw two plots, one for the combined effect of the parents, and one where each parent s effect on the phenotype is shown as a separate subplot. The p-value results from Fisher s exact test on haplotype phenotype association. The most significant values found in randomizations are also shown, as is the background significance level. 2.3 Troubleshooting Here is a list of some possible problems and solutions when running the ixora analysis via the user interface Problem: There is an error in the execution of ixora. Solution: The most likely reason is that the input file is formatted incorrectly, or the appropriate option regarding the presence of phenotype data in the input is not set. Please take a look at the examples in the sections regarding the input format. Problem: The ixora phasing analysis takes a long time to complete. Solution: Reduce the number of randomization tests for the statistical significance. The phasing should run in a short time for, e.g. hundreds of markers and individuals. Problem: There is no reaction when double-clicking on the ixora.jar file, or it takes a long time to react. Solution: Kill any lingering javaw processes associated with ixora (in Windows via the Task Manager). 11

Chapter 3 Command line executable for exact haplotype inference This chapter describes the use of the C++ executable ixora.exe for exact haploype inference. The inputs and outputs are defined as described in Chapter 1. 3.1 Parameters The binary file ixora can be executed with command line parameters: ixora <-i (or -p) input> [-self] [-missparents] [-phen] [-stats] [-r T] [-debug] [-o path] where: - i: assumes the input file in ixora file format; - p: assumes the input file in ped file format; - self: indicates that the two parents are the same; - missparents: indicates the parents genotypes are unknown; - phen: indicates that phenotype is included in the input; - stats: produces a stats file as described in the section Output format; - r T: indicates the number of randomizations for the p-value threshold computation, by default T=20; - debug: provides in the standard output all the information computed by ixora step by step; - o: define the path where the output is stored in the main memory; 12

The parameter denoted with the <> bracket must be provided. If any of the remaining parameters are not set, it is assigned the default value. For example, to execute ixora using the input file ExampleData.txt (without phenotype information) or ExampleDataPheno.txt (with phenotype information), possible command lines include: ixora -i ExampleData.txt -stats ixora -i ExampleDataPheno.txt -phen -stats -r 50 13

Bibliography [1] F. Utro, N. Haiminen, D. Livingstone, O.E. Cornejo, S. Royaert, R.J. Schnell, J.C. Motamayor, D.N. Kuhn, and L. Parida. ixora: Exact Haplotype Inferencing and Trait Association. BMC Genetics 14:48, 2013. 14