QTL Analysis and Linkage Library (QTL-ALL)

Size: px

Start display at page:

Download "QTL Analysis and Linkage Library (QTL-ALL)"

Blaze French
5 years ago
Views:

1 QTL Analysis and Linkage Library (QTL-ALL) User Guide User Guide for QTL-ALL version /13/08 Edition of the User Guide Nandita Mukhopadhyay Samsiddhi Bhattacharjee Chia-Ling Kuo Daniel E. Weeks Eleanor Feingold Departments of Human Genetics and Biostatistics University of Pittsburgh Graduate School of Public Health The development of this software was supported by NIH R01 HG (Feingold) and NIH/Fogarty D43 TW (Weeks). 1

2 Table of Contents 1 Overview 4 2 Getting Started If you already have QTL-ALL version How to download QTL-ALL Quick install instructions Installing on Linux and Solaris Installing on Mac Installing on Windows Quickstart: Analysis of example data on Linux, Solaris and Mac Quickstart: Analysis of example data on Windows-Cygwin Running QTL-ALL on your own data How to report a bug or make a suggestion 26 3 Running the Program Input Locus data file Pedigree data file Map data file Omit file Menus for input data handling Choice of statistics to calculate Statistic selection menu Parameter values IBD computation Which statistic should you use for sibships? Which statistic should you use for sib pairs? Output 45 2

3 3.3.1 Results files Other important output files Additional files Menus for output specification Additional comments and recommendations Using QTL-ALL with dense SNP datasets Using QTL-ALL in batch mode Customizing your QTL-ALL runs and plots Troubleshooting Changes from version Statistics Score statistics for general pedigrees Notation and theory SCORE.CT, SCORE.EV, and SCORE.MAX SCORE.MERLIN SCORE.2DF.CT HM.CT, HM.2DF.CT, HM.MERLIN SCORE.NAIVE, SCORE.CIBD, HM.NAIVE Special statistics for sib pairs Notation and theory ORIGINAL.HE HE.COM1 and HE.COM XU IBD RDP GU.EMPIRICAL Statistic calculation details Missing data handling Matrix operations 73 5 References 75 3

4 1 Overview What it does: QTL-ALL (Quantitative Trait Locus Analysis and Linkage Library) is a QTL linkage analysis program. It calculates a wide variety of QTL linkage statistics for nuclear families. Score statistics are emphasized, as are statistics appropriate for ascertained (nonpopulation) samples such as concordant and discordant sibling pairs. Currently, only sib-pair and sibship data are supported, but we plan to extend the capability to arbitrary pedigrees soon. How it works: QTL-ALL uses standard linkage analysis input file formats (see section 3.1). There are text-based menus for setting up the analysis that have a similar look and feel to Mega2 (Mukhopadhyay et al., 2005). The QTL-ALL set-up program takes the user through the menus, reads the data files, and performs a number of basic data checking functions. It then writes a shell script to perform the analysis. The shell script calls an external program to compute identity-by-descent (IBD) estimates (currently a user choice of Merlin or SimWalk2) and then uses the IBD estimates and the phenotype data to compute the linkage statistics that the user has chosen. User-controlled displays of results include a variety of tables and plots of test statistics and p-values. Statistics available: QTL-ALL computes a number of variants of the score statistic (see chapter 4 for details). These include versions of the statistic that are robust to selected sampling, as well as higher moment score statistics (Chen et al. 2005), 2 degree of freedom (dominance) statistics, and the MERLIN-REGRESS statistic (Sham et al. 2002). Other standard statistics include the Haseman-Elston statistic (Haseman and Elston, 1972), and statistics for discordant and concordant sibling pairs and combinations thereof. Computing platforms: QTL-ALL is portable across many platforms such as Solaris, Linux, Dec-Alpha, Macintosh, and Cygwin running on Windows. It can be run locally on a desktop or remotely from a server. It can be run interactively as well as in a non-interactive mode, making it suitable for cluster (or batch) computing. Major Limitations: If your dataset contains pedigrees other than full-sib nuclear families, these will not be included in the analysis. QTL-ALL will produce a warning, listing the pedigrees that cannot be analyzed, and will exclude them. For nuclear families, parental phenotypes (if specified) will not be included in the analysis, but parental genotypes will be used for identityby-descent (IBD) computation. QTL-ALL also does not handle X-linked traits. If the dataset includes markers on the X chromosome, these markers will not be used. How to use this user guide: Chapter 2 describes how to download and install QTL-ALL, as well as other necessary software (e.g. R, perl). If you are impatient to jump right in and run the 4

5 program, try following the Quickstart instructions in chapter 2. For a more detailed explanation of how to use QTL-ALL, including input and output options and selection of statistics, read chapter 3. Chapter 4 contains reference material on all of the statistics that QTL- ALL computes theory, formulas, and performance results. References: Most of the score statistics implemented in QTL-ALL are described in Bhattacharjee et al. (Bhattacharjee S, Kuo C-L, Mukhopadhyay N, Brock GN, Weeks DE, Feingold E. Robust score statistics for QTL linkage analysis. American Journal of Human Genetics, 82: , 2008), which also contains detailed simulation results. This paper is the one that you should cite when you use QTL-ALL in a publication. Additional theory and simulation studies can be found in the references cited in chapter 4 of this user guide. 5

6 2 Getting Started 2.1 If you already have QTL-ALL version 1.0 If you already have QTL-ALL version 1.0, you can use the instructions in the following sections to install version There are just two things you need to be aware of. 1) The R plotting package nplplot is no longer included in the QTL-ALL distribution. It must be separately downloaded from CRAN s add-on packages distribution site. 2) When you run the new installation script, it will install binary executables labeled with the new version number, so it does not overwrite the older QTL-ALL binary executables. However, it does overwrite the Perl scripts. The installation script will ask you whether the aliases qtlall, qtlall_sibships, and qtlall_sibpairs should be replaced to point to their new counterparts. If you answer yes, the command qtlall will run the new version of the code instead of the old one. 2.2 How to download QTL-ALL You can download either pre-compiled versions of QTL-ALL or source code from our web site, If you want to use the pre-compiled versions, packages are available as zipped archives (archived using gnu-tar and compressed using gnu-zip) for linux, solaris, and Windows-Cygwin. For the Macintosh, QTL-ALL is only available as a source distribution in order to ensure that you end up with the most efficient binary for the processor that you have. Pre-compiled QTL-ALL packages are named as follows. qtlall_v1.0.1_linux.tar.gz : compiled on an Intel running Linux qtlall_v1.0.1_solaris.tar.gz : compiled on a Solaris running Solaris 2.8 qtlall_v1.0.1_cygwin.tar.gz : compiled for Windows using Cygwin You can get these by clicking the bin link on the download page. The Source distribution of QTL-ALL is named 6

7 qtlall_v1.0.1_src.tar.gz Expiration date: QTL-ALL has a built-in expiration date that is usually a year from its release date. This date will be posted with the release, and is displayed with every run of the program. We do this in order to remind (OK, force) you to get regular updates of the program. We expect to make rapid fixes and enhancements to QTL-ALL, and we would like you to work with the most up-to-date version possible. 2.3 Quick install instructions This short installation procedure is meant for users who are experienced Unix users, i.e are familiar with using and installing software on Linux, Solaris, Mac s Darwin OS, or Cygwin on Windows. If you are not so familiar with using Unix, please follow the more detailed instructions given in section 2.4, 2.5 or 2.6, depending on your platform. 1. Before you install QTL-ALL from a binary distribution, make sure that the following software is installed on your computer as well. a. GSL (the Gnu Scientific Library) b. Perl c. R and nplplot d. bash and csh/tcsh e. Merlin or Simwalk2 (QTL-ALL uses these to compute identity-by-descent estimates) 2. If you are installing from source, which is the only option for Mac, also make sure that Gcc (Gnu C-compiler) and Gnu-make are installed. 3. If you are installing on Windows from either the binary or source distributions, make sure that Cygwin is installed. 4. To install QTL-ALL from pre-compiled binaries or from the source distribution, untar and uncompress the gzipped QTL-ALL archive, then change directory into the QTL-ALL install folder, and type./install.sh 5. To run QTL-ALL on example data and assuming that qtlall is located within your set of executables paths, change your directory into the example folder, and then type qtlall 7

8 2.4 Installing on Linux and Solaris Before you install QTL-ALL: The following software should be installed before you install and use QTL-ALL. 1. gcc (the Gnu-C compiler) These are almost always installed by default within the Linux operating system, however, you may need to install these separately if you are using Solaris. 2. GSL (the Gnu Scientific Library) Pre-compiled GSL binaries are now part of most GNU/Unix installations. However, if this is not the case for your Unix installation, you will have to obtain and install GSL yourself. GSL can be downloaded as a tar-gzip archive from one of the mirrors listed at It is also available in the RPM (RedHat Package Manager) format at To install from the tar-gzip archive, uncompress and untar the source, and then run i) configure ii) make iii) make install <- requires root access. 3. Perl Perl is needed to reformat some output files. It is a part of most common Unix/Linux distributions. If you do not have it, Perl is available from 4. R and nplplot The R statistical package and the nplplot R library are required to create p-value plots. R is available from and nplplot can found in the list of addon packages at Instructions for downloading and installing R and add-on packages are available at this site. 5. bash and csh/tcsh The bash shell interpreter is required to install QTL-ALL. The csh (or tcsh) interpreter is needed to analyze data using QTL-ALL. These are part of any standard Linux or Solaris distribution, so typically you don t need to do anything extra to install these. 8

9 6. Gnu-make This is necessary only if you are planning to compile QTL-ALL from the source distribution 7. Merlin and/or Simwalk2 QTL-ALL uses Merlin or Simwalk2 to compute identity-by-descent estimates. You will need one or the other of these (probably Merlin) to use QTL-ALL. Merlin can be downloaded from Simwalk2 can be downloaded from To install QTL-ALL from pre-compiled binaries: After downloading the QTL-ALL distribution package, you can install it using the following steps. 1) Uncompress the package using the "gunzip" Unix command and then untar it using the tar command. For example, for the solaris package you would say: > gunzip qtlall_v1.0.1_solaris.tar.gz > tar -xvf qtlall_v1.0.1_solaris.tar If you are using the GNU version of the tar command, you can combine the above two steps via a single command: > tar -xzvf qtlall_v1.0.1_solaris.tar.gz The "z" instructs tar that it is a compressed archive. At this point, you should see a new directory named qtlall_v1.0.1_[platform] For example, if you downloaded qtlall_v1.0.1_linux.tar.gz, you should see a directory named qtlall_v1.0.1_linux. This directory contains the files and sub-directories listed below in alphabetical order: CHANGES.txt example example_output_sibpairs example_output_sibships install.sh INSTALL.txt List of changes for this version Directory of example input files Directory of example output files Directory of example output files Bash script for installing perl scripts and binary executables File containing installation instructions 9

10 LICENSE.txt REQUIREMENTS.txt qtlall_v1.0.1_linux.gz qtlall_log2html.pl.src qtlall_output_stats.pl.src qtlall_pvalue2rplot.pl.src qtlall_read_merlin_ibds qtlall_read_merlin_kmx.pl.src File containing license agreement File listing additional software needed. QTL-ALL binary (compressed) Perl script to convert run logs into HTML format Perl script for producing tables of p-values and statistics Perl script to convert QTL-ALL pvalues to nplplot format Binary executable for reading in Merlin IBD files Perl script for reading in Merlin kinship matrix files qtlall_read_simwalk2_ibds.pl.src Perl script for reading SimWalk2 IBD files qtlall_sibpair_v1.0.1_linux.gz qtlall_sibships_v1.0.1_linux.gz QTLALL User Guide.doc Sibpair statistic computation library (compressed) Sibship statistic computation library (compressed) This document NOTE: The Perl scripts have the extension.src, because they are missing the path of the Perl interpreter at the top of the script. During installation, this path will be added, and the resulting script will then be directly executable. Your system will need the unix bash shell to execute the installation script. Execute the script install.sh by typing >./install.sh This script performs checks to see if Perl and R are installed, and will fail to install if they are absent. Once these checks have been performed, the install script prompts you to enter a directory to copy the binary executables and scripts into. You must have read and write permissions on this directory. If you want to install in a system directory such as /usr/local/bin you may need to be logged in as administrator/root or you may have to use the command sudo./install.sh. The install script will ask you to enter the full path name of the desired directory, e.g. 10

11 > Enter folder name for scripts and binaries to be installed: /usr/local/bin For these scripts to be accessible, the directory you enter must be in your path. 4) For QTL-ALL to work correctly, the install script needs to create aliases (or soft-links in Unix lingo), for the statistics routines named qtlall_sibpairs and qtlall_sibships. If you have previously existing files with these names, you will be asked if these files should be replaced by the new links. 5) If installation proceeded smoothly without any errors, QTL-ALL should now be installed and ready to use! To make absolutely sure that QTL-ALL components are in the right place, use the Unix which command. On some platforms, you may need to first type rehash to refresh your search path (this is not necessary on Linux, and the rehash command is not a standard Linux command). rehash which qtlall If installation succeeded, you will see something like the following in response to the which command. /usr/local/bin/qtlall If installation was not completed correctly, you will get either an empty message or something to the effect that qtlall was not found. Failure to install may indicate that you do not have adequate access rights to the directory QTL-ALL is being installed into (on Unix, you have to have write permissions to write into the directory that you specified above), or that system requirements were not met (e.g. Perl is not installed). To install QTL-ALL from source: If you prefer to compile QTL-ALL on your own machine, you can download the tar-gzipped source directory named qtlall_v1.0.1_src.tar.gz. You will need gnu-make to compile QTL-ALL. To compile and install, simply execute the installation script by typing./install.sh. Just like the pre-compiled binary distribution, the install script will ask you to enter the full path name of the desired directory, e.g. > Enter folder name for scripts and binaries to be installed: /usr/local/bin For these scripts to be accessible, the directory you enter must be in your path. The install script will compile the QTL-ALL data-setup and statistics programs, and then place all executables and Perl scripts inside the folder you specified at the first prompt. Note that 11

12 compilation of the source code will fail if you do not have GSL (GNU Scientific Library) preinstalled. 2.5 Installing QTL-ALL on a Mac Because there are compatibility issues between Intel Macs and Mac power PC, we do not have a pre-compiled binary for Mac users. Therefore, you will need to download the source bundle and compile QTL-ALL yourself following the instructions given below. Hints for Darwin and/or Unix newbies: The instructions below assume that you are at least somewhat familiar with the Darwin Unix environment (i.e. navigating your Mac and using Unix commands in the terminal window). QTL-ALL itself is run via the terminal window. If you are not at all familiar with Unix, you will probably need some help, but even if your Unix fluency does not extend much beyond cd and ls, you can probably follow the instructions below to install and run QTL-ALL. Here are a few hints that might be helpful along the way. 1) If you do not currently have a terminal program running, you can look in the Applications folder, then the Utilities folder, and open the terminal application. This will give you a simple terminal in which you can enter the Unix commands described below. 2) To see whether you have a particular piece of software already installed, you can use the command which from within the terminal window. For example, to check whether you have gcc installed, type which gcc. The result should be something like /usr/bin/gcc if you have gcc. If you don t have it, you will just get a fresh prompt (i.e. no response). 3) When you download software (QTL-ALL or any of the preliminaries described below), it will generally end up on your desktop or in your downloads folder. You can then double-click to uncompress it so that it is ready to install. Items on the desktop can be seen in the /Desktop subdirectory using the Darwin interface (terminal window). 4) When you install software you may need to be logged in as root or administrator, and you may need to use the command sudo. For example, in the final step of installation of QTL-ALL (described below), you may need to type sudo./install.sh instead of just./install.sh. You will then be prompted for your password (which must be for an account with admin privileges) in order to complete the installation. Before you install QTL-ALL: The following software should be installed before you install and use QTL-ALL. 12

13 1. Gcc (Gnu C-compiler) and Gnu-make Note that you may not have gcc and Gnu-make installed by default. If this is the case, you would need to install Xcode developer tools on your Mac. This is provided with your Mac installation disks, or you can download it from This is free, but you have to register. You can check whether you have gcc installed using the which command as described in the hints above. To figure out whether you have Gnu-make, run make with the v option: make v You should see something like GNU Make 3.80 Copyright (C) 2002 Free Software Foundation, Inc. etc. etc. 2. GSL (the Gnu Scientific Library) GSL is required to compile and run QTL-ALL. Gnu scientific library (GSL) available free under the GNU public license from Download the newest version. Double-click on the downloaded GSL distribution file to uncompress and untar. This creates a GSL folder labeled with the version (for example gsl-1.9.tar.gz). From inside of the terminal window, cd into this folder, then type the following commands. i)./configure This runs the configurations script inside the GSL folder. ii) make Make is a Unix command that compiles the GSL source into C-libraries obeying the compilation steps given inside the Makefile file inside the GSL folder. iii) make install <- requires root access. This step tries to copy the compiled libraries into the location where all other C- libraries are kept. For the default location and how this can be changed, please look at the README file given inside the GSL folder. 3. Perl Perl is needed to reformat some output files. It is part of the Xcode developer tools as well. Again, you can use which to see if you already have Perl. 4. R and nplplot 13

14 The R statistical package and the nplplot R library are required to create p-value plots. R is available from If you do not already have R, download it and install it. This can be done entirely within the Mac interface no Unix necessary. To install nplplot, (or any other R package), it is easiest to use the Packages&Data menu from within R (or you can use the R command install.packages). You can also download nplplot directly from and install it yourself from within a Unix shell using the R CMD INSTALL command. 5. Merlin or Simwalk2 QTL-ALL uses Merlin or Simwalk2 to compute identity-by-descent estimates. You will need one or the other of these (probably Merlin) to use QTL-ALL. Merlin can be downloaded from Simwalk2 can be downloaded from To install QTL-ALL: After downloading the QTL-ALL source distribution package file, qtlall_v1.0.1_src.tar.gz, you can install it using the following steps. Let us assume that this file resides on your Desktop. 1) Uncompress the package by double-clicking on it. 2) At this point, you should see a new sub-directory (sub-folder) named qtlall_v1.0.1_src on your Desktop. You can either double-click this folder inside the Finder, or use the ls command inside the terminal window to check its contents. This directory contains the files and sub-directories listed below in alphabetical order: CHANGES.txt example example_output_sibpairs example_output_sibships install.sh INSTALL.txt LICENSE.txt REQUIREMENTS.txt qtlall_src.tar.gz List of changes for this version Directory of example input files Directory of example output files Directory of example output files Bash script for installing perl scripts and binary executables File containing installation instructions File containing license agreement File listing additional software needed. Source for QTL-ALL data setup 14

15 qtlall_log2html.pl.src qtlall_output_stats.pl.src qtlall_pvalue2rplot.pl.src qtlall_read_merlin_ibds qtlall_read_merlin_kmx.pl.src (compressed) Perl script to convert run logs into HTML format Perl script for producing tables of p-values and statistics Perl script to convert QTL-ALL pvalues to nplplot format Binary executable for reading in Merlin IBD files Perl script for reading in Merlin kinship matrix files qtlall_read_simwalk2_ibds.pl.src Perl script for reading SimWalk2 IBD files qtlall_statistics_src.tar.gz QTLALL User Guide.doc Source for sibpair and sibships statistic computation library (compressed) This document NOTE: The Perl scripts have the extension.src, because they are missing the path of the Perl interpreter at the top of the script. During installation, this path will be added, and the resulting script will then be directly executable. 3) Now, from within the terminal widow, cd into the qtlall_v1.0.1_src folder. 4) Execute the script install.sh by typing./install.sh This bash script performs checks to see if Perl and R are installed, and will fail if they are absent. Once these checks have been performed, the install script prompts you to tell it what directory (folder) to copy the binary executables and scripts into. If you want to install in a system directory such as /usr/local/bin you may need to be logged in as administrator/root and/or you may have to use the command sudo./install.sh. 5) The install script will ask you to enter the full path name of the desired directory, e.g. Enter folder name for scripts and binaries to be installed: /usr/local/bin For these scripts to be accessible (i.e. for QTL-ALL to work), the directory you enter must be in your path. 15

16 6) For QTL-ALL to work correctly, the install script needs to create aliases (or soft-links in Unix lingo), for the statistics routines named qtlall_sibpairs and qtlall_sibships. If you have previously existing files with these names, you will be asked if these files should be replaced by the new links. 7) If installation proceeded smoothly without any errors, QTL-ALL should now be installed and ready to use! To make absolutely sure that QTL-ALL components are in the right place, use the Unix which command. 2.6 Installing QTL-ALL on Windows Note for Cygwin and/or Unix newbies: QTL-ALL runs within Cygwin, which is a Unix-like environment for the Windows operating system. If you have never used Cygwin or Unix before, you will probably need some help with the installations described below. Installing Cygwin is not trivial. But if you have Cygwin or can get it installed, you should be able to follow the directions below and do the rest of the installation even if your Unix fluency is relatively modest. Before you install QTL-ALL: The following software should be installed before you install and use QTL-ALL. 1. Cygwin Download and install Cygwin from the Cygwin distribution site. IMPORTANT: the cygwin version should be or higher to be able to run QTL- ALL. Click on the "install or update now!" link and follow the instructions. The cygwin user's guide can be found at: If your Windows machine has multiple user-accounts, you should install cygwin when logged in as "administrator. Also, select the "All users" default option to make Cygwin available to all users. You should do a custom install of Cygwin, and make sure that the following are installed: a) Perl b) bash and tcsh c) An editor such as nano, or vim (inside "editors") If you are compiling QTL-ALL from source, you will also need: d) make e) gcc (Gnu C-compiler) 16

17 f) Gnu C libraries g) GSL GNU scientific library See detailed instructions below on each of these components. 1a) Perl Perl is needed to reformat some output files. To install Perl, click on the checkbox next to the item called Perl inside the top-level list of Cygwin components, and change the "default" to "install". In the newest version of Cygwin, this is set to install by default, however you should verify this. 1b) bash and tcsh The bash shell interpreter is required to install QTL-ALL. The tcsh interpreter is needed to analyze data using QTL-ALL. These shell are listed under "shells. Click on the checkboxes next to these items, and change the default to install. 1c) Nano or Vim These are available within editors. As before, for each item you want to install, and click on the checkbox next to it and change default to install. 1d) Make and 1e) Gcc These are available within Devel (development tools). 1f) Gnu C libraries QTL-ALL uses basic C libraries such as the math library, as well as the Gnu scientific library or GSL. The base libraries are installed along with gcc. 1g) GSL Gnu scientific library A compiled version of GSL is found inside the Lib component of Cygwin. GSL can also be compiled from source, which is available from (follow the the link to Download GNU, then Free software directory ). If you do forget to install any of these components, you may update the existing installation using the same setup script setup.exe. The Cygwin environment Cygwin provides a Unix-like environment inside the Windows operating system. During the installation process it will prompt you for an installation path. Let us assume that you choose to install it on the Desktop. Then you will see a folder called Cygwin on the Desktop, which contains folders named "home", "bin", "etc","tmp", "sbin", "usr" etc. An icon is created on the Desktop for running the Cygwin program. When you click on this icon, Cygwin starts an interactive shell (resembling the Windows DOS shell), and puts you inside the /home/<your_user_name> folder. 17

18 Accessing data from inside of Cygwin Files within Cygwin folders from the Windows can be opened using the Windows file explorer. All Cygwin files and folders are kept within the Cygwin installation folder (in our example: Desktop/Cygwin). To access Windows files from inside of Cygwin, note that relative to Cygwin, all Windows files reside on C:/"Documents and Settings"/<your_user_name> e.g. files on your Desktop can be listed using the unix "ls" command inside Cygwin as follows: ls C:/"Documents and Settings"/"Administrator"/Desktop/ (Inside Unix, you are always required to enclose names inside quotes if they have whitespaces inside them.) Unix organization conventions The "home" folder contains subfolders with usernames matching those that have useraccounts on your Windows machine. Each user has full access and modification permissions to his or her /home/<user_name> folder. All other folders can only be written or modified by the administrator. The /bin and /sbin folder contains system programs that are very basic to unix (such as ls, cd, rm, mkdir, bash, tcsh etc.). The /usr folder hierarchy is used to install programs that should be made available to all users. Widely-used standard Unix software such as compilers and language development tools (gcc, Perl, Python) are directly installed inside /usr. Thus to run Perl from inside Cygwin, you would invoke /usr/bin/perl. There is also a /usr/local folder hierarchy inside the /usr folder which is normally used to install lesser known or non-standard software, of which QTL-ALL would be an example. Thus, you may choose to install your genetic analysis software (such as Merlin, SimWalk2 etc.) inside the /usr/local/bin folder. 2. R basic install R is required to use QTL-ALL p-values. The windows version of R can be downloaded from the R distribution site By default R will install into a folder named R inside the Program Files folder. Each version of R installs into a separate sub-folder inside R, e.g. if you installed version 2.6.1, you should have c:/program Files/R/R The R executable (.exe) file is named R.exe and resides inside R > bin. To make sure that Cygwin knows where to find R, start up Cygwin as administrator and create an alias to this R-executable using the following command; 18

19 ln -s "c:/program Files/R/R-2.6.1/bin/R.exe" /usr/bin/r 3. R library nplplot To install add-on R packages, open up R, go to the "Package & Data" menu item, select the Install packages menu item, and use the "install from the web" option to install nplplot. 4. Merlin and/or Simwalk2 QTL-ALL uses Merlin or Simwalk2 to compute identity-by-descent estimates. You will need one or the other of these (probably Merlin) to use QTL-ALL. Merlin can be downloaded from Simwalk2 can be downloaded from To install QTL-ALL from pre-compiled binaries: After downloading the QTL-ALL distribution package, you can install it using the following steps. 1) Uncompress the package using the "gunzip" Unix command and then untar it using the tar command. gunzip qtlall_v1.0.1_cygwin.tar.gz tar -xvf qtlall_v1.0.1_cygwin.tar If you are using the GNU version of the tar command, you can combine the above two steps via a single command: tar -xzvf qtlall_v1.0.1_cygwin.tar.gz The "z" instructs tar that it is a compressed archive. At this point, you should see a new directory named qtlall_v1.0.1_cygwin. This directory contains the files and sub-directories listed below in alphabetical order: CHANGES.txt Example example_output_sibpairs example_output_sibships install.sh INSTALL.txt List of changes for this version Directory of example input files Directory of example output files Directory of example output files Bash script for installing perl scripts and binary executables File containing installation instructions 19

20 LICENSE.txt REQUIREMENTS.txt qtlall_v1.0.1_cygwin.gz qtlall_log2html.pl.src qtlall_output_stats.pl.src qtlall_pvalue2rplot.pl.src qtlall_read_merlin_ibds qtlall_read_merlin_kmx.pl.src File containing license agreement File listing additional software needed. QTL-ALL binary (compressed) Perl script to convert run logs into HTML format Perl script for producing tables of p-values and statistics Perl script to convert QTL-ALL pvalues to nplplot format Binary executable for reading in Merlin IBD files Perl script for reading in Merlin kinship matrix files qtlall_read_simwalk2_ibds.pl.src Perl script for reading SimWalk2 IBD files qtlall_sibpair_v1.0.1_cygwin.gz qtlall_sibships_v1.0.1_cygwin.gz QTLALL User Guide.pdf Sibpair statistic computation library (compressed) Sibship statistic computation library (compressed) This file. NOTE: The Perl scripts have the extension.src, because they are missing the path of the Perl interpreter at the top of the script. During installation, this path will be added, and the resulting script will then be directly executable. 2) Execute the script install.sh by typing./install.sh This script performs checks to see if Perl and R are installed, and will fail to install if they are absent. Once these checks have been performed, the install script prompts you to enter a directory to copy the binary executables and scripts into. You must have read and write permissions on this directory. If you want to install in a system directory such as /usr/local/bin you may need to be logged in as administrator/root or you may have to use the command sudo./install.sh. 3) The install script will ask you to enter the full path name of the desired directory, e.g. Enter folder name for scripts and binaries to be installed: /usr/local/bin 20

21 For these scripts to be accessible, the directory you enter must be in your path. 4) For QTL-ALL to work correctly, the install script needs to create aliases (or soft-links in Unix lingo), for the statistics routines named qtlall_sibpairs and qtlall_sibships. If you have previously existing files with these names (for example links to previously installed versions of qtlall libraries), you will be asked if these files should be replaced by the new links. 5) If installation proceeded smoothly without any errors, QTL-ALL should now be installed and ready to use! To make absolutely sure that QTL-ALL components are in the right place, use the Unix which command. which qtlall If installation succeeded, you will see something like the following in response to the which command. /usr/local/bin/qtlall If installation was not completed correctly, you will get either an empty message or something to the effect that qtlall was not found. Failure to install may indicate that you do not have adequate access rights to the directory QTL-ALL is being installed into (on Unix, you have to have write permissions to write into the directory that you specified above), or that system requirements were not met (e.g. Perl is not installed). To install QTL-ALL from source: If you prefer (or need) to compile QTL-ALL on your own machine, you can download the targzipped source directory named qtlall_v1.0.1_src.tar.gz. You will need gnu-make and gcc to compile QTL-ALL. To compile and install, follow these steps type./install.sh. The install script will compile QTL-ALL, and then place all executables and Perl scripts inside the folder you specified at the first prompt. Note that compilation will fail if you do not have GSL (GNU Scientific Library) pre-installed. 2.7 Quickstart: Analysis of example data on Linux, Solaris and Mac We have provided two sets of example data one of sib pair data and one of sibship (more than two sibs) data. To run QTL-ALL on the example data we have provided, change your directory into the example folder, and then type 21

22 qtlall This command runs the QTL-ALL set-up program. You will be presented with the first input menu: ========================================================== QTL-ALL v1.0.1 Copyright (c) 2007 The University of Pittsburgh Last updated: Jun , 11:48:14 This version is valid until July 31, QTL-ALL comes with ABSOLUTELY NO WARRANTY. See LICENSE.txt for terms of copying, modifying & redistributing QTL- ALL. ========================================================== Moved existing QTLALL.BATCH to QTLALL.BATCH.old QTL-ALL input menu: ========================================================== 0) Done with this menu - please proceed 1) Chromosome number: 1 2) Input file extension: 01 3) Locus datafile: _ 4) Pedigree datafile: _ 5) Map datafile: _ 6) Omit datafile (optional): _ 7) Directory for writing output: [ Current directory ] 8) Warn if allele frequency error measure exceeds: No limit Select from options 0-8 > Select item 2, and type sibships or sibpairs, which will fill in the input file names. Then select item 0 to indicate that you are done with this menu. QTL-ALL then proceeds with the locus and trait selections menus, then output file names and statistics selections. For most of these it suffices to simply type 0, which proceeds to the next menu, without making any changes to current selections (which are initialized to their defaults). In the statistics menu, however, you do have to enter a list of statistic numbers. You can also run the QTL-ALL set-up program by using one of the batch files, e.g. qtlall QTLALL.BATCH.sibpairs The QTL-ALL set-up program will create an analysis script entitled qtlall.sh, which you can use to perform the analysis by typing./qtlall.sh If the analysis succeeds, you should see results files qtlall_pvalues.05, qtlall_statistics.05, and qtlall_plots.05.pdf when you open QTLALLrun.html in your browser. 22

23 Your own data can be run in exactly the same way. Make sure you are inside the directory that contains your input files, and then follow the instructions above for running QTL-ALL with the example data. For help on setting up your own input files consult section 2.9. More details on input files are given in Chapter Quickstart: Analysis of example data on Windows-Cygwin a) If you are running QTL-ALL as a "non-administrative" user (say Joe Schmoe ), copy the example/ folder onto your desktop b) From inside of Cygwin, cd into the example/ folder. > cd c:/"documents and Settings"/"Joe Schmoe"/Desktop/example c) Check its contents by using the ls command. The example/ folder contains two datasets, a sibsip dataset and a sibpair dataset. Sibship data files are named pedin.sibships, datain.sibships, map.sibships and QTLALL.BATCH.sibships. Sibpair data files are named pedin.sibpairs, datain.sibpairs, map.sibpairs and QTLALL.BATCH.sibpairs. README.sibships and README.sibpairs contains more information on these files. d) Run QTL-ALL, by simply issuing the command qtlall: qtlall You will be presented with the first input menu: ========================================================== QTL-ALL v1.0.1 Copyright (c) 2007 The University of Pittsburgh Last updated: Jun , 11:48:14 This version is valid until July 31, QTL-ALL comes with ABSOLUTELY NO WARRANTY. See LICENSE.txt for terms of copying, modifying & redistributing QTL- ALL. ========================================================== Moved existing QTLALL.BATCH to QTLALL.BATCH.old QTL-ALL input menu: ========================================================== 0) Done with this menu - please proceed 23

24 1) Chromosome number: 1 2) Input file extension: 01 3) Locus datafile: _ 4) Pedigree datafile: _ 5) Map datafile: _ 6) Omit datafile (optional): _ 7) Directory for writing output: [ Current directory ] 8) Warn if allele frequency error measure exceeds: No limit Select from options 0-8 > Select item 1, and type sibpairs (sibships), to set input file names to pedin.sibpairs (sibships), datain.sibpairs(sibships), etc. QTL-ALL then proceeds with the locus and trait selections menus, then output file names and statistics selections. For most of these it suffices to simply type 0, which proceeds to the next menu, without making any changes to current selections (which are initialized to their defaults). In the statistics menu, however, you do have to enter a list of statistics numbers. Alternatively, you can run QTL-ALL using one of the batch files, e.g.: qtlall QTLALL.BATCH.sibpairs(sibships) This command runs the QTL-ALL set-up program. It will create an analysis script entitled qtlall.sh, which you can use to perform the analysis by typing./qtlall.sh If the analysis succeeds, you should see results files qtlall_pvalues.05, qtlall_statistics.05, and qtlall_plots.05.pdf when you open QTLALLrun.html in your browser. Your own data can be run exactly as above. From inside of Cygwin, simply change into your data folder, and run QTLALL. Section 2.9 below describes briefly how to set up your own input files. For more detailed information on file-formats, go to Chapter Running QTL-ALL on your own data The basic steps involved in running QTL-ALL are as follows. 1) Prepare the three necessary input files: pedigree, locus, and map (see section 3.1 or the brief instructions in this section). 2) Execute the QTL-ALL set-up program (command qtlall). This program will take you through the menus to set up your run and will also read in the data and check for errors. 24

25 3) You choose your statistic(s) using one of the menus that qtlall shows you. Read section 3.2 (or better yet chapter 4) of the user guide to pick the right statistic. Also be sure to read section about choosing parameter values. 4) Execute the qtlall.sh script created by the set-up program. This will call Merlin or Simwalk2 to estimate IBDs, and then will compute the statistics you selected and produce output files. 5) The easiest way to find your output files is to use a browser to open the QTLALLrun.html file that is created inside your input data. This will display a clickable browser window showing all of your files. QTL-ALL needs a bare minimum of three data files to run: a LINKAGE-format pedigree file, a locus data file, and a Mega2-format map file. If you have these three files already set up for Mega2, you can use them for QTL-ALL without any modification. You can also use data previously set up for other familiar linkage analysis programs. The pedigree file can be either in pre-makeped LINKAGE format or standard (post-makeped) LINKAGE format. Just make sure that all traits are quantitative traits. If any of the pedigrees are not nuclear families, QTL-ALL will omit them. The locus data file can be either the LINKAGE-format locus file as used by GeneHunter, or the simpler QTDT format names file used by Merlin. The Map file should contain chromosome numbers, locus names and genetic positions of loci in Haldane or Kosambi centimorgans. This file has column headings as below. Chromosome Haldane Marker These headers must be followed by one or more lines specifying the locus positions. If you have a Kosambi map, simply change the header of column 2 to Kosambi. Also, if you have a Merlin Map file, simply switch the 2 nd and 3 rd columns to create the QTL-ALL map file. Note that Merlin map positions use the Haldane map function. QTL-ALL uses two non-standard additions to the input files: a selection indicator in the pedigree file, and trait parameters in the locus file. These are not absolutely required for QTL- ALL to run, so you can wait until chapter 3 to read about them. Please note, however, that the choice of the trait parameters is critical to sensible use of the program, so we urge you to read section before you use QTL-ALL for real. If you give all three input files a common extension (e.g. pedin.dat, names.dat, map.dat ), then in the very first input menu (see section 3.1.5), all you need to do is to specify the extension (dat) in item 2, and QTL-ALL will automatically fill in the complete file names. Make sure you are inside the directory that contains your input files, and then follow the instructions above for running QTL-ALL with the example data. 25

26 2.10 How to report a bug or make a suggestion We appreciate any comments you have about your experience using QTL-ALL, and especially any suggestions you have for features that you would like us to add in future releases. To make a comment or report a bug, log into our site using your registered , and then click on the feedback link. A feedback form lets you specify the program (QTL-ALL in our case), the version number (1.0.1), the feedback type (comment, enhancement or bug-report), and then fill in a text report. Click on the submit button at the bottom, and your comment or bug-report will be ed to us. 26

27 3 Running the Program The basic steps involved in running QTL-ALL are as follows. 1) Prepare the three necessary input files: pedigree, locus, and map (see section 3.1). 2) Execute the QTL-ALL set-up program (command qtlall). This program will take you through the menus to set up your run and will also read in the data and check for errors. 3) You choose your statistic(s) using one of the menus that qtlall shows you. Read section 3.2 (or better yet chapter 4) of the user guide to pick the right statistic. Also be sure to read section about choosing parameter values. 4) Execute the qtlall.sh script created by the set-up program. This will call Merlin or Simwalk2 to estimate IBDs, and then will compute the statistics you selected and produce output files. 5) The easiest way to find your output files is to use a browser to open the QTLALLrun.html file that is created inside your input data. This will display a clickable browser window showing all of your files. 3.1 Input QTL-ALL requires three input files: the pedigree, locus and map data files. It also reads an optional omit file, for omitting specific individuals, pedigrees or entire loci from the analysis. Sections through below describe the format of each input file in detail. There are also several text menus that help set up your inputs to the program. The input file menu tells QTL-ALL what the names of the input files are. The locus selection menus allow you to specify which markers and traits from the files to use in the analyses. The missing phenotype value menu allows you to specify a missing-data indicator for the phenotype data. These menus are described in section After reading in the input files, QTL-ALL performs a series of checks on the data. Data problems can be classified into three different types: errors that are so serious that QTL-ALL terminates, errors that can be rectified or ignored, and data that are correct by LINKAGE file format standards but do not fall within constraints imposed by QTL-ALL (e.g. extended pedigrees). 27

28 Error messages for the first two types of error, and warning messages for the third class of errors are displayed on the screen, as well as being stored in a log file named QTLALL.ERR. Note that QTL-ALL s Mendelian inheritance checks may not catch all Mendelian errors, as the checks are limited to detecting blatant inconsistencies, such as mismatches between offspring and parental genotypes, or more than 4 unique alleles within a sibship. Thus, it is advisable to run your data through a more rigorous pedigree checking program such as Pedcheck. If the dataset does contain Mendelian inconsistencies that are undetected by QTL-ALL, then subsequent IBD computation by Merlin or SimWalk2 will usually fail on these families and markers Locus data file The locus data file contains information about marker and trait loci. It also determines the order in which genotype and phenotype data are read in from the pedigree file. Two formats are supported by QTL-ALL: 1) names file (similar to Merlin s QTDT format), and 2) LINKAGE format. Names file: The names file consists of only locus types and names. Four locus types are recognized: autosomal numbered (M), x-linked numbered (X), quantitative traits (T) and covariates (C). If a names file is used the pedigree file can contain un-coded genotypes which could be base-pair sizes or non-numeric allele names. Thus, it eliminates the need for conversion of raw genotype data into consecutively numbered alleles, which is required by the LINKAGE format specification. It also eliminates the need to have allele frequencies for each of the chromosomal markers: instead, frequencies are estimated from the pedigree data itself. When a names file is used, data in the pedigree file are read in as follows; each marker genotype is read in as a pair of character strings, and a trait or covariate is read in as a single real value. QTL-ALL will then estimate allele frequencies from the observed data, and recode the marker genotypes as numbered alleles. For example, the following names file specifies that the first phenotype/genotype column in the pedigree data is a quantitative trait named Q1 followed by genotypes at locus M1, and so on. :::::::::::::: names.05 :::::::::::::: T Q1 M M1 M M2 M M3 LINKAGE format: If your genotype data have already been recoded into consecutively numbered alleles and you wish to use population allele frequencies, then you may choose the more complicated LINKAGE format instead. QTL-ALL requires locus names to be included in addition to the usual information present in a LINKAGE-format file. The standard (but not wellknown) LINKAGE format for including locus names is to, right after the number of alleles, put a # sign followed by the marker name. 28

29 ::::::::::::: datain.05 ::::::::::::: << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM << MUT LOCUS, MUT RATE, HAPLOTYPE FREQUENCIES (IF 1) # Q << GENE FREQUENCIES 1 << NO. OF TRAITS << GENOTYPE MEANS << VARIANCE - COVARIANCE MATRIX << MULTIPLIER FOR VARIANCE IN HETEROZYGOTES 3 2 # M << GENE FREQUENCIES 3 2 # M << GENE FREQUENCIES 3 2 # M << GENE FREQUENCIES 0 0 << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2) << RECOMBINATION VALUES << REC VARIED, INCREMENT, FINISHING VALUE As highlighted in red above, this setup would give the name Q1 to the first locus, and the name M1 to the second locus, etc. (It is OK to omit the space between the # sign and the locus name, if desired). Specifying trait distribution parameters: Most of the linkage statistics that QTL-ALL calculates require the specification of trait parameters. These parameters are the mean, variance, sibling correlation, skewness, and kurtosis of the trait in the population (not in your sample). (The kurtosis that we use is what is often called excess kurtosis ). You can specify these parameters in the locus file or on QTL-ALL menu, or you can have QTL-ALL calculate them from your data if your dataset is a random (unascertained) sample from the population. Please read section for more discussion of these parameter values the statistics will not perform correctly if you do not think carefully about what parameter values you use. To specify the parameter values in the locus file, enter 5 additional numbers to the right of the name of a quantitative trait locus (in both formats), denoting mean, variance, sibling correlation, skewness and kurtosis values in that order. Be sure to use the variance, not the standard deviation. You may use a missing-value identifier (e.g. 99) to indicate if you do not have values for some of these parameters. This will be the same missing-value identifier that you use for trait data (see section on the missing phenotype value menu). For example, if you choose to use a missing value indicator of 99, then the fourth line of the example file above specifies for Q1 a mean of 65.0, a variance of 4.0, a sib correlation of 0.1, and no specified values for skewness and kurtosis. Either all 5 values need to be specified with numbers or with a missing value indicator, or there should be no values specified. If this is not the case, QTL-ALL will stop with an error message. 29

30 Selection indicator: QTL-ALL will offer you different choices of statistics depending on how your data were ascertained. You specify the ascertainment type for each family in the pedigree file using what we call the selection indicator. Since the selection indicator goes into what would normally be a locus/trait column in the pedigree file, it must also be listed in the locus file. To do this using the names format file, use the following line. S selection-status To do this using the LINKAGE-format file, insert the following locus record: -1 0 # selection-status IMPORTANT: The position of this line with respect to the other locus records must match the position of the selection column with respect to the genotype/phenotype columns inside the pedigree file. Note that this status indicator is currently used only for sibpair studies [but may be supplied for sibship data also]. For further discussion on the selection-status variable see section below Pedigree data file The pedigree data file contains information about the relationships of the study subjects as well as their phenotype and genotype data. Each line of a pedigree file identifies an individual. The pedigree data file should be in either pre-makeped LINKAGE format or standard (post- Makeped) LINKAGE format. The LINKAGE format is essentially the de facto standard for coding pedigree information for a large number of linkage analysis software packages. For a complete description of this format, please see the Handbook of Human Genetic Linkage (Terwilliger and Ott 1994) and the LINKAGE Users Guide (at Pre-Makeped format allows for inbreeding loops, while post-makeped assumes that these loops have been broken. Post-Makeped format also includes additional relation fields in the pedigree information columns (first offspring, sibling etc.). In our case, inbreeding is not relevant to the file formats, since QTL-ALL only handles nuclear families. However, we have allowed for both formats to give the user some flexibility. Below is an example of a pre-makeped file to match the example locus datafile above, with one quantitative trait, Q1, followed by three markers, M1, M2, and M3. :::::::::::::: pedin.pre.05 :::::::::::::: Id: Id: Id: Id: Id:

31 Id: Id: Id: Id: Id: Id: Id: Id: 3004 As shown, the pre-makeped LINKAGE pedigree file consists of columns of numerical data. The pre-makeped columns are as follows. Pedigree Person Father Mother Gender Phenotype1 Phenotype2 Phenotype3..., Missing parents are entered as 0 (zero), and, for the gender column, 1 = Male and 2 = Female. (This is easy to remember if you think of the number of X chromosomes). Makeped inserts some additional columns of pointers (which would be difficult to enter by hand) and breaks loops, which is required by the LINKAGE programs. The columns should be separated by spaces or tabs (it doesn t matter how many). Below is an example post-makeped file corresponding to the pre-makeped file above. :::::::::::::: pedin.05 :::::::::::::: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: Ped: 100 Per: 11 Here, the Ped: 100 Per: 2 information at the end of each line has been added by the Makeped program. While this information is optional, QTL-ALL can use it to allow the original pedigree IDs and person IDs to be used in the output. However, QTL-ALL can currently only handle numeric pedigree labels (no characters allowed). Genotype coding: To code a codominant autosomal marker locus phenotype, simply list the two numbered alleles as integers with at least one space or tab between the alleles. The unknown 31

32 genotype is coded as 0 0. For an X-linked marker locus males must be coded as homozygotes, (as coding them as hemizygotes would break the column alignment) but QTL-ALL will ignore any loci that are listed as being located on the X chromosome. Quantitative trait coding: Phenotype values for a quantitative trait are real numbers. You can specify the missing-value code using the missing phenotype value menu (section 3.1.5). The missing phenotype value should be a unique number that does not ever occur as an observed true value. The missing-data value should be the same one used to specify missing parameter values inside the locus file. Selection indicator: As discussed above, QTL-ALL wants to know how your pedigrees were ascertained, so that it can offer you an appropriate choice of statistics for your dataset. You specify the ascertainment type for each family in the pedigree file using what we call the selection indicator. Selection indicator values are 0 = population or unspecified sampling, 1 = concordant sibling pair (affected pair), and 2 = discordant sibling pair. We expect to include additional selection indicators (and corresponding statistics) in the future, but since QTL-ALL does not currently include any specialized statistics for phenotypically selected general sibships, the selection status should be set to 0 for all pedigrees with more than two siblings. Since the pedigree file format is a strict table-format, each individual has to be assigned a selection type, even though this value applies to the entire family. QTL-ALL checks to see that all members of a family are assigned the same selection status. Alternative pedigree and person identifiers: LINKAGE-format pedigree files allow additional identifiers for individuals and pedigrees inside the pedigree file, and these are not restricted to numbers as the first two columns are. However, QTL-ALL can currently only handle numeric pedigree and person identifiers. A pedigree name can be inserted by adding to the end of the line, for example: Pedigree: 1001 A person identifier is specified similarly, for example: Person: 11 In addition, a unique id for each individual can also be specified (i.e. unique over the entire set of pedigrees), by using the ID: field, e.g. ID: These identifiers can be used inside the output pedigree and phenotypes files. Pedigrees excluded by QTL-ALL: QTL-ALL excludes any pedigree that is not a nuclear family. Excluded pedigrees would include parent-offspring trios, singletons, multi-generation pedigrees, and nuclear families with half-siblings. The exclusion process takes place after the input data has been read in, and these pedigrees are simply ignored for the rest of the analysis. Pedigrees with fewer than two genotyped offspring or fewer than two phenotyped offspring are also excluded. 32

33 (Note however that some individuals not included in the analysis are still used to in the computation of parameter values see section 3.2.2) Map data file The map file gives the (relative) map position of each marker in centimorgans (cm). If two markers fall at exactly same position, then QTL-ALL will assume that the marker listed first should come first, and will automatically add a small increment (of cm) to the position of the second marker during analysis. QTL-ALL distinguishes between Haldane and Kosambi map distances by looking at the first line of the map file. If the second column heading is "Kosambi," then the distances are read in as Kosambi centimorgans. When computing IBDs with SimWalk2, the appropriate mapping function is used to convert the inter-marker distances into recombination fractions. Example: :::::::::::::: map.05 :::::::::::::: CHROMOSOME KOSAMBI NAME M M M2 NOTE: Any marker that is in the locus file must be given a map position in the map file. Thus, the marker names used in the map file must match exactly the names used in the locus file. If a codominant marker locus in the locus file is not found in the map file, then QTL-ALL will warn you about this. If you ignore the warning, then this locus will not be included in the analysis. If you have more loci in the map file than appear in the locus file you will be warned, but it does not pose any difficulties. HINT: This feature can be used to easily exclude a marker from all analyses - simply alter the name of the marker in the map file so that it no longer matches, or remove the marker from the map file Omit file The omit file specifies a list of (pedigree, person, marker) sets that are to be omitted from analysis. QTL-ALL sets the genotype or phenotype indicated by each set to unknown immediately after reading in the data. A list of omitted genotypes and phenotypes is logged inside the QTLALL.LOG file. These genotypes are excluded from allele-frequency estimation, genotyping rate summaries, Mendelian error checking, and IBD computation. The omit file can be useful for flagging and omitting problem genotypes (e.g. those identified by Pedcheck as potential errors) without actually changing your pedigree file. 33

34 Here is an example omit file corresponding to the above pedigree file. 1 0 M1 2 2 All The first column is the pedigree number, the second column contains either a specific individual id, or 0, which refers to all individuals within that pedigree, and the third column contains a specific marker name or All, which refers to all markers. In this example, all genotypes for marker M1 are to be set to unknown for pedigree 1, and all genotypes of individual 2 of pedigree 2 are to be set to unknown. To omit the entire pedigree 1 from analysis, use the line 1 0 All Menus for input data handling Input file menu: This is the starting menu of QTL-ALL. You use it to tell QTL-ALL what the input file names are. The menu will initially look like this: ========================================================= QTLALL input menu: ========================================================= 0) Done with this menu - please proceed 1) Chromosome number: 1 2) Input file extension: 01 3) Locus datafile: _ 4) Pedigree datafile: _ 5) Map datafile: _ 6) Omit datafile (optional): _ 7) Directory for writing output: [ Current directory ] 8) Warn if allele frequency error measure exceeds: No limit 9) Missing phenotype and parameter indicator: Select from options 0-9 > The simplest way to name your files is to give them a common extension as follows. datain.<"ext"> pedin.<"ext"> map.<"ext"> omit.<"ext"> Then, when the input menu appears, simply select option 2, and then specify the extension at the prompt. Often the extension will be a chromosome number (e.g. 01 ). If you name your files this way, then you can just specify the chromosome in item 1 of the input file menu and QTL- ALL will assume the rest. Note that for chromosomes 1 9 you probably want to use extensions 34

35 01 09 so as to force the lexical order of these file names to follow the numerical order. Note also, however, that you do not need to have separate files for separate chromosomes. It s fine to combine markers from different chromosomes into the same input files. For example, to use the files datain.10, pedin.10, and map.10 in your analysis, select item 1, when the input file menu above is displayed, and type 10 at the prompt. The menu display will then reappear with the file names filled in as follows. ========================================================= QTLALL input menu: ========================================================= 0) Done with this menu - please proceed 1) Chromosome number: 10 2) Input file extension: 10 3) Locus datafile: datain.10 4) Pedigree datafile: pedin.10 5) Map datafile: map.10 6) Omit datafile (optional): _ 7) Directory for writing output: [ Current directory ] 8) Warn if allele frequency error measure exceeds: No limit 9) Missing phenotype and parameter indicator: Select from options 0-9 > Or (for example), if you now realize that your locus file is actually called locus.10, you can enter 3 and then locus.10 at the prompt to change the name for that file. Choices made in items 3 6 override anything specified in item 2. The omit file is optional. If an omit file is present in the directory you are working in and is named omit.< ext > it will be listed in the menu next to option 6. If you do not want to use that omit file, you must enter 6 and then type clear. The only other item you may need to change here (once your file extensions are verified), is the missing quantitative phenotype indicator value. This indicated unknown phenotypes for individual s quantitative traits inside the pedigree file, as well as unknown distribution parameters inside the locus data file. This value is set to our default If your missing phenotype values are something else, you need to enter that value here. Enter 9 here. QTL-ALL will now prompt you for the value: ========================================================== Enter missing quant value > If your missing value indicator is -9, for example, enter -9 here, and hit return. ========================================================== Enter missing quant value > -9 35

36 You will be taken back to the input menu again, and now the new missing value indicator will be displayed. You can now enter 0 to continue with the analysis. Marker selection menu: This menu allows you to select a subset of the markers for use in the analysis. The menu will initially look like this. Marker Locus Selection Menu 0) Done with this menu - please proceed *1) Select all loci in map order on chromosome: 5 2) Select by locus number. 3) Select loci on multiple chromosomes[] Enter 0-3 > The asterisk next to item 1 in the menu indicates that the default selection of all markers has been chosen. To change that, type 2 or 3 at this menu to see options for selecting a subset of markers or multiple chromosomes. For example, if you select item 2, you will see the following. ========================================================== Select marker loci on chromosome 5 Select a set of loci by entering their numbers in the desired order. Separate each number by a space. Enter 'v' to view the original list of loci and their numbers. Enter 'm' to select all marker loci on current chromosome. Enter 'o' to list the ordered loci you have selected. Enter 'e' to terminate the selection process. Enter a set of loci numbers (enter 'e' to terminate selection): > Now if you enter v, the original list of markers is displayed. Original list of loci Name Position Type 1 D5G Numbered 2 D5G Numbered 3 D5G Numbered 4 D5G Numbered 5 D5G Numbered 6 Q1_MRK Numbered 7 D5G Numbered 8 D5G Numbered 9 D5G Numbered 10 D5G Numbered 11 D5G Numbered 36

37 Enter a set of loci numbers (enter 'e' to terminate selection): > To choose (for example) loci 1, 2, 6, 10, and 11, you would enter the following at the prompt. Enter a set of loci numbers (enter 'e' to terminate selection): > e This would display the selected loci and then repeat the locus reorder menu as below. Selected loci (in final order) Locus name Map position Locus Type D5G Numbered D5G Numbered Q1_MRK Numbered D5G Numbered D5G Numbered Locus Reordering Menu 0) Done with this menu - please proceed. 1) Select all loci in map order on chromosome: 5 *2) Select by locus number. 3) Select loci on multiple chromosomes[5 ] Enter 0-3 > Note that the asterisk is now on item 2 indicating that this is the current selection. If you now enter 0 to proceed, then the final selections are displayed. Selected loci (in order) Locus name Map position Locus Type D5G Numbered D5G Numbered Q1_MRK Numbered D5G Numbered D5G Numbered These are the 5 markers that will be analyzed by QTL-ALL. Trait selection menu: This menu is analogous to the marker selection menu, but it allows you to specify which traits to include in the analysis. It also allows you to choose whether or not to combine all traits into a single output file. The default choices are shown in the menu in square brackets. Quantitative trait selection menu: 0) Done with this menu - please proceed. 1) Create trait-specific files or combine: [Trait-specific] 37

38 2) Trait loci selected: [AntiCCP] Enter option 0-2 > If you type 0 to continue, your analysis will include only the trait AntiCCP and there will be one output file for each trait. If you type 2 instead, QTL-ALL will display a list of traits for you to select from. Typing 1 will toggle between trait-specific and combined output modes. 3.2 Choice of statistics to calculate QTL-ALL computes a large number of different QTL linkage statistics, in particular a large number of variants of the score statistic. Most statistics are applicable to any pedigree that QTL- ALL allows, but some are only for sib pairs. We provide a lot of statistic choices for the convenience of users who have strong preferences of their own and users who want to be able to compare statistics. However, we do not recommend p-value fishing by asking QTL-ALL to compute many statistics and then choosing the one with the nicest results. That strategy may get you published, but it won t get you into heaven (assuming, of course, that God is a statistician). Rather, we recommend making an a priori choice of the statistic that is best for your study. In most cases that will be SCORE.MAX or SCORE.CT. Sections and give summary recommendations on statistic choice, and the gory details on the theory and all our previous simulation results comparing the statistics are in chapter 4. Note that while we have included the MERLIN-REGRESS statistic in QTL-ALL (under the name SCORE.MERLIN), we really recommend that you use MERLIN-REGRESS if you want that statistic. Due to computational limitations, QTL-ALL only computes this statistic for small pedigrees (6 or fewer offspring) and only if the offspring are fully phenotyped. However, SCORE.CT is very similar, and is much faster than either our implementation of MERLIN- REGRESS or the original, and SCORE.MAX is equally fast and a bit more powerful Statistic selection menu The statistic selection menu gives you a list of statistics to choose from, along with brief recommendations on which statistic to use for what type of dataset. Choose the statistics you want by specifying a list of numbers at the prompt. The order in which you list the statistics will be the order in which they appear in the output. Note that this is a single shot menu, with no way to change selections once they are made. 38

39 The list of statistics shown in the menu will depend on the pedigrees and selection types in the dataset. For example, discordant pair statistics are only available if the dataset consists entirely of discordant pairs (sibships of size two with a selection indicator of 2). The example menu below shows all of the statistics that are available for general sibships, except that the statistics SCORE.MERLIN and HM.MERLIN are omitted because the dataset contains sibships of size greater than six. Statistic selection menu ========================================================= SCORE.MERLIN and HM.MERLIN unavailable (data contains sibships larger than 6), Statistics available for sibships (see QTL-ALL User Guide for more detail). Recommended statistics: 1) SCORE.MAX (maximum of SCORE.CT and SCORE.EV - correct type I error and robust power - recommended) 2) SCORE.CT (score statistic conditional on trait - slightly more conservative than SCORE.MAX) 3) HM.CT (higher moment version of SCORE.CT - recommended for highly non-gaussian traits if good estimates of trait distribution moments are available) 4) SCORE.2DF.CT (2 degree of freedom version of SCORE.CT - usually lower power than SCORE.CT except for models with very high dominance variance) 5) HM.2DF.CT (2 degree of freedom version of HM.CT) Statistics that are not recommended under most circumstances: 6) SCORE.EV (score statistic with empirical variance - conservative and underpowered in most cases) 7) SCORE.NAIVE (score statistic with naive variance estimate - incorrect type I error for selected or non-gaussian data) 8) SCORE.CIBD (score statistic conditional on IBD - incorrect type I error for selected data) 9) HM.NAIVE (higher moment version of SCORE.NAIVE - incorrect type I error for selected data) ========================================================= Enter string of statistic numbers ('e' to terminate) > Parameter values A critical aspect of the score statistics is that they require the user to specify the parameters of the trait distribution: mean, variance, sib pair correlation (the kurtosis that we use is what is often called excess kurtosis ). The higher moment statistics also use the skewness and kurtosis. All of these parameter values must be the population values, not the values in your sample. If your sample is unselected (unascertained), the sample values are probably a good approximation to the population values. But if you have a selected sample (e.g. with extreme probands in each family), using the sample parameter estimates will result in very poorly-performing (low-power!) 39

40 statistics. Therefore it is critical that you use the best possible population trait parameter estimates in QTL-ALL. (Note that this issue is not unique to QTL-ALL it is also relevant to MERLIN-REGRESS and to some of the Haseman-Elston statistics implemented in S.A.G.E., for example.) For simulation results showing the performance of the score statistics with poor parameter estimates, see T.Cuenco et al. 2003, Szatkiewicz et al. 2003, and Bhattacharjee et al. submitted. There are two ways to specify the trait parameters in QTL-ALL. You can tell QTL-ALL exactly what values to use in the locus file (see section 3.1.1). Or, if you have an unselected sample, you can let QTL-ALL calculate parameter estimates from your sample. You can also use some specified values and some calculated (estimated) values. The phenotype distribution parameter menus let you control this. If you don t plan to use the higher-moment statistics, it doesn t matter what you use for the skewness and kurtosis values. When the phenotype distribution parameter menu is first displayed, it will show you which parameter values you already entered in the locus file and which ones QTL-ALL is planning to calculate. In the example below, no values were entered in the locus file, so each field shows CALC, indicating that QTL-ALL will calculate the value from your data. QTL parameter menu (CALC stands for values to be estimated by QTLALL) 0) Done with this menu - please proceed. Locus Mean Variance Sib-sib corr Skewness Kurtosis 1) Q1 CALC CALC CALC CALC CALC 2) Q2 CALC CALC CALC CALC CALC 3) Q3 CALC CALC CALC CALC CALC 4) Q4 CALC CALC CALC CALC CALC Enter QTL number to change parameter value, 0 to continue > If you select 0 at this point, all parameter values will be estimated from your data. If you select one of the traits instead (e.g. 2 ) you will have the option to input parameter values for that trait. For example: Enter QTL number to change parameter value, 0 to continue > 2 Trait Q2 : 0) Done with this menu - proceed. 1) Calculate all or use input values [CALC] 2) Mean : [CALC] 3) Variance : [CALC] 4) Correlation : [CALC] 5) Skewness : [CALC] 6) Kurtosis : [CALC] Enter option number 0-6 > The first item toggles between CALC and INPUT. In the CALC mode, parameters are estimated from data irrespective of whether they were input or not. In the INPUT mode, userspecified values are honored, if they were provided in the locus file, or by a previous iteration of 40

41 this menu. For example, if you only specified the mean and variance in the locus file, and the other three values were set to unknown, then this menu would look like this. Enter option number 0-6 > 1 Trait Q2 : 0) Done with this menu - proceed. 1) Calculate all or use input values [INPUT] 2) Mean : [0.0] 3) Variance : [1.0] 4) Correlation : [CALC] 5) Skewness : [CALC] 6) Kurtosis : [CALC] Enter option number 0-6 > You could now select the remaining parameters one by one to specify correlation, skewness and kurtosis values. If QTL-ALL estimates the parameter values, it uses all phenotyped offspring of nuclear families. Note that this may include some pedigrees that are not included in the actual linkage analysis, such parent-child trios or families in which several children have phenotypes but only one has a genotype. Parental phenotypes, if present, are not used in the parameter estimation in order to avoid possible cohort effects IBD computation QTL-ALL calls external software to compute identity-by-descent (IBD) estimates. Currently QTL-ALL can use either SimWalk2 or Merlin as the IBD computation engine. The choice is made in the IBD selection menu. Both Merlin and SimWalk2 allow estimation of IBDs at one or more points that lie between actual marker loci positions (as specified in the map file). SimWalk2 allows each interval to be divided into a specific number of sub-intervals. Merlin also allows this type of stepwise IBD-computation, as well as a grid, in which the step size is specified in centimorgans. QTL-ALL does not currently handle the grid option. The number of intervals is also specified via the IBD computation menu. 0) Done with this menu - please proceed 1) Number of steps between markers: 3 2) IBD program name: Merlin Select option 0-2 > 2 The number of inter-marker intervals is set to 3 by default. To change this, select option 1, and enter the number desired. 41

42 Merlin is the default IBD computation engine. To change this, select option 2, then enter the program number from the list (which currently has only 2 items (1) Simwalk2, and (2) Merlin. Support for SimWalk2 is provided in anticipation of extending QTL-ALL to handle larger pedigrees in the near future. However, since QTL-ALL currently only supports nuclear families, in most cases one would be best off choosing to run Merlin instead of the slower approximate SimWalk2. In addition, the SCORE.MERLIN and HM.MERLIN statistics require kinship matrices, which only Merlin can estimate. Therefore, if either of these statistics is selected, Merlin is automatically chosen as the IBD program, and this menu only lets you change the number of inter-marker intervals. Merlin options: Merlin requires the input file names to be provided as command-line options, as well as other analysis parameters, such as the type of analysis to be conducted, limits to the memory size for each pedigree etc. QTL-ALL calls Merlin with the required set of options needed to compute IBDs. These include pedigree, locus, map and frequency file names, the -- ibd option as the analysis, the --trim option to trim completely ungenotyped offspring, and the --steps option for the inter-marker points. If you want QTL-ALL to use different Merlin options, you can edit the qtlall.sh script before running it, though we do not recommend doing this unless you are confident that you know what you are doing (see section 3.4.3). Simwalk2 options: SimWalk2 options are provided within the SimWalk2 batch file. QTL-ALL sets up the appropriate batch file to (a) perform IBD estimation, and (b) produce IBD estimates at the required number of inter-marker points. The other options are set to default values as recommended by SimWalk2 s authors Which statistic should you use for sibships? Sibship data means data from two-generation families in which all sibs are full sibs i.e. all pedigrees that QTL-ALL will currently analyze. This is distinguished from sib pair data in which there are only two offspring per family. All the statistics for sibship data in QTL-ALL are score statistics, and all of them treat the entire nuclear family as a unit. Parental phenotypes are ignored, but genotypes are used in the IBD computations. Most of these statistics are also in theory applicable to arbitrary pedigrees, but that capability is not currently implemented in QTL- ALL. Basically, SCORE.MAX is the recommended statistic to use in most situations, although it has slightly inflated type I error for small samples. SCORE.CT has slightly lower power than SCORE.MAX, but has better type I error for small samples. Both SCORE.CT and SCORE.MAX are score statistics with empirical denominators that make them robust to both non-gaussian trait distributions and selected sampling. Again, we do not recommend using SCORE.MERLIN except for very small sibships due to speed considerations use the MERLIN-REGRESS software instead if that s the statistic you want. QTL-ALL will not currently let you use 42

43 SCORE.MERLIN or HM.MERLIN unless all your pedigrees have 6 or fewer offspring and there are no missing phenotypes. If you know that the gene you are trying to find acts dominantly or recessively (i.e. has a high dominance variance), consider a two degree of freedom statistic instead of SCORE.CT or SCORE.MAX. If your trait distribution (in the whole population) is highly non-gaussian (nonnormal), consider a higher moment statistic. But if you have a phenotypically-selected sample and you are not sure you have a good estimate of what the trait distribution looks like in the underlying (unselected) population, it s best to stick with SCORE.MAX or SCORE.CT. In particular, a higher moment statistic is only going to help you out with a non-normal trait if you have good estimates of the higher moments (skewness and kurtosis). The following flow chart spells this out a bit more algorithmically, and more detail is given in chapter 4. 43

44 How were the families selected? Population sample. Phenotypically selected sample. Are good estimates of population (not sample) trait parameters available? Are you sure about those parameter estimates? Yes. Yes. What does the trait distribution look like? No. No. Highly non-normal. Reasonably Normal. Do you have reason to believe that putative gene high dominance variance? Do you have reason to believe that putative gene has high dominance variance? Definitely. Who knows? Definitely. Who knows? HM.2DF.CT HM.CT SCORE.2DF.CT SCORE.MAX or SCORE.CT 44

45 3.2.5 Which statistic should you use for sib pairs? Sib pair statistics are for datasets that consist exclusively of nuclear families with two children. All of the sibship score statistics mentioned above can be used for sib pair data, but there are also a number of special statistics for sib pairs. In general, for many common sib pair study designs (see below), there are special sib-pair statistics that have similar power to the score statistic but are more robust to non-normality and to misspecification of population trait parameters. Discordant sibling pairs: A score statistic can be used (chosen according to the flow chart above), but the standard IBD-sharing statistic (called IBD in QTL-ALL) has very high power for extreme discordant pairs, and its power is much more robust to non-normality than that of the score statistics (Szatkiewicz et al. 2003). Moreover, the IBD-sharing statistic does not use estimates of population trait parameters at all, so its power does not suffer if you don t have good estimates of those. If the sib pairs are only moderately discordant, the IBD statistic will have poor power, but the Robust Discordant Pair (RDP) statistic has the same robustness properties (Szatkiewicz and Feingold 2004). Concordant sibling pairs: We recommend the IBD statistic for extreme concordant pairs, but if the concordance is only moderate the score statistic is preferred (Szatkiewicz et al. 2003). This means that in most cases the score statistic will be best for analyzing a quantitative trait measured in affected sib pairs unless the quantitative trait being mapped is very closely correlated with the binary disease trait that was used to select the families. Mixed discordant and concordant sibling pairs: The GU.EMPIRICAL statistic is an IBDsharing statistic for studies that consist of a mixture of concordant and discordant pairs. It does not require estimates of population trait parameters, but its power relative to the score statistic is very difficult to predict (Szatkiewicz and Feingold 2005). We recommend the score statistic for mixed discordant and concordant data. Population samples and singly-ascertained samples: Singly-ascertained samples mean those in which the family is ascertained based on the phenotype of only member of the pedigree, as opposed to concordant or discordant pairs, in which two phenotypes are used for ascertainment. For population or singly-ascertained samples, the XU statistic often has higher power than the score statistic (T.Cuenco et al. 2003), especially if the trait distribution is non-gaussian and/or the trait parameters are not well estimated. 3.3 Output 45

46 When your QTL-ALL run is complete (set-up program and qtlall.sh script), you should open the file qtlallrun.html in your web browser. This will give you a clickable display of results files and run logs/diagnostics, which are described individually below. The browser window will look something like this Results files There are three basic results files: STATISTICS, PVALUES, and PLOTS. The STATISTICS and PVALUES files will exist for each chromosome you analyzed and, if you requested it in the trait selection menu, for each trait separately, in the trait-specific 46

cgatools Installation Guide

cgatools Installation Guide Version 1.3.0 Complete Genomics data is for Research Use Only and not for use in the treatment or diagnosis of any human subject. Information, descriptions and specifications in this publication are subject