Trait Analysis by association, Evolution and Linkage (TASSEL) User Manual

Size: px

Start display at page:

Download "Trait Analysis by association, Evolution and Linkage (TASSEL) User Manual"

Penelope Lee
5 years ago
Views:

1 Trait Analysis by association, Evolution and Linkage (TASSEL) User Manual

Testing and Validation of TASSEL were performed by the Buckler Lab at North Carolina State University and Cornell

pertinent person in the current programming team. General Information Ed Buckler (Project leader) esb33@cornell.

edu Zhiwu Zhang zz19@cornell.edu Acknowledgement of contributors: Yogesh Ramdoss, Michael E. Oak, and Karin J. Holmberg.

This project is supported by the USDA-ARS and the National Science Foundation. Open source code: http://sourceforge.

2 Testing and Validation of TASSEL were performed by the Buckler Lab at North Carolina State University and Cornell University Last update: March 13, 2009 For more quick and precise answers, please address your questions to the most pertinent person in the current programming team. General Information Ed Buckler (Project leader) Data import, GDPC, Pipeline Terry Casstevens Data analysis by GLM/MLM Peter Bradbury Zhiwu Zhang Acknowledgement of contributors: Yogesh Ramdoss, Michael E. Oak, and Karin J. Holmberg. The technical editing was provided by N. Stevens for this manuscript. This project is supported by the USDA-ARS and the National Science Foundation. Open source code: Extensive use of the PAL library is made ( Database access is achieved by GDPC middleware ( ii

3 Table of Contents INTRODUCTION 1 1 GETTING START WEB START STAND-ALONE OPEN SOURCE CODE MENU FILE TOOLS PANELS METHOD BUTTONS 6 2 DATA MODE GDPC FILE SEQUENCE ALIGNMENT POLYMORPHISM ALIGNMENT ANNOTATED ALIGNMENT NUMERICAL TRAIT OR COVARIATE SQUARE NUMERICAL MATRIX SITES TAXA GENOTYPE CONVERTER TRANSFORM, STANDARDIZE AND IMPUTE DATA GENOTYPE NUMERICALIZATION TRANSFORM AND/OR STANDARDIZE DATA IMPUTING MISSING DATA PCA SYNONYMIZE TAXA NAMES UNION JOIN INTERSECTION JOIN 22 3 ANALYSIS MODE DIVERSITY 24 iii

4 3.2 LINKAGE DISEQUILIBRIUM CLADOGRAM KINSHIP MATRIX GENERAL LINEAR MODEL MIXED LINEAR MODEL STRUCTURED ASSOCIATION TEST MANUAL SNP EXTRACTOR BULK SNP EXTRACTION 35 4 RESULTS MODE TABLE LD PLOT TREE PLOT CHART D PLOT GENE PLOT VISUAL GENOTYPE HAPLOTYPE VIEWER 43 5 TUTORIAL IMPORTING DATA FROM FLAT FILES SNP DATA PHENOTYPE DATA POPULATION STRUCTURE DATA KINSHIP DATA ANNOTATED ALIGNMENT IMPORTING DATA FROM A DATABASE (VIA GDPC) CONNECTING WITH A DATABASE DATA QUERY IMPORTING GDPC DATA INTO TASSEL SAVING GDPC QUERY RESULTS IMPORTING/EXPORTING DATA TO/FROM A DATA TREE DATA MANIPULATION FILTERING DATA JOINING DATASETS 55 iv

5 5.4.3 MISSING VALUE IMPUTATION PRINCIPAL COMPONENT ANALYSIS (PCA) USING PRINCIPAL COMPONENTS AS STRUCTURE 59 ASSOCIATION ANALYSIS USING GLM ASSOCIATION ANALYSIS USING MLM WORKING WITH GENOTYPE DATA DATA VISUALIZATION D PLOT XY PLOT LD PLOT TREE PLOT 76 6 ASSOCIATION CASE STUDIES CASE 1: PHENOTYPE + CANDIDATE SNP MARKERS (HAPLOID) CASE 2: PHENOTYPE + CANDIDATE SNP MARKERS (HAPLOID) + Q CASE 3: PHENOTYPE + CANDIDATE SNP MARKERS (HAPLOID) + Q +K CASE 4: PHENOTYPE + CANDIDATE SNP MARKERS (HAPLOID) WITH A SET OF RANDOM MARKERS CASE 5: PHENOTYPE + CANDIDATE SSR MARKERS (HAPLOID) CASE 6: PHENOTYPE + CANDIDATE SNP MARKERS (DIPLOID) PRESENTATION OF ASSOCIATION RESULT 81 7 APPENDIX NUCLEOTIDE CODES ASSIGNED BY IUB AS USED IN THE BULK SNP METHOD TASSEL TUTORIAL DATASET BIOGRAPHY OF TASSEL FREQUENTLY ASKED QUESTIONS IN GENERAL WHAT DO I DO IF TASSEL MISBEHAVES? HOW DO I CHANGE THE AMOUNT OF MEMORY USED? WHAT DO I DO WHEN THE EXCEPTION JAVA.LANG.OUTOFMEMORYERROR APPEARS? HOW DO I JOIN THE FUN: TASSEL ON SOURCEFORGE? WHERE DO I TURN FOR MORE INFORMATION? WHEN I CLICK ON THE MOST CURRENT VERSION OF TASSEL WEB START, A PREVIOUS VERSION APPEARS. WHAT SHOULD I DO? WHAT SHOULD I SUBSTITUTE FOR MISSING VALUES IN TASSEL? I HAVE 30 MARKER LOCI (295 GERMPLASM LINES) BUT EACH LOCI HAS 2 ALLELES (DIFFERING IN SIZE). IT SEEMS THAT TASSEL CONSIDERS THE TWO ALLELES AS DIFFERENT LOCI, THUS ACCEPTING 60 LOCI INSTEAD OF 30. IS THERE A WAY TO FIX THIS? IS IT POSSIBLE TO CHANGE DATA NAMES IN THE DATA TREE? HOW TO CREATE TAASEL ICON ON DESKTOP? FREQUENTLY ASKED QUESTIONS ON GLM/ML WHY DO I GET EMPTY SQUARES IN MLM ASSOCIATION ANALYSIS? HOW TO SOLVE NON-CONVERGE PROBLEM? WHY SHOULD I EXCLUDE ONE COLUMN OF THE POPULATION STRUCTURE? I RAN BOTH SAS AND TASSEL ON THE SAME DATA AND GOT RESULTS FROM SAS BUT TASSEL DID NOT OUTPUT AN ANSWER. WHY? IF I HAVE KINSHIP DATA, IS POPULATION STRUCTURE STILL NEEDED WHEN USING THE MLM? OR IS ONLY KINSHIP, TRAIT AND GENOTYPE DATA NEEDED? MY POPULATION CONSISTS OF OUTBRED DIPLOID INDIVIDUALS. THE MATRIX THAT I GENERATED USING THE KIN FUNCTION IN TASSEL WAS DRASTICALLY DIFFERENT FROM THE KINSHIP COEFFICIENT MATRIX THAT I GENERATED USING SPAGEDI (USING THE LOISELLE '95 KINSHIP COEFFICIENT). WHY? 87 v

6 7. HOW CAN I GET R SQUARE FROM SAS PROC MIXED? HOW R SQUARE ARE CALCULATED FOR MODEL AND MARKER IN MLM? HOW HERITABILITY IS DEFINED? DOES MLM FIND MORE ASSOCIATION THAN GLM?? DO I NEED MULTIPLE TEST CORRECTION FOR THE P VALUE FROM TASSEL? HOW TO CREATE KINSHIP MATRIX? REFERENCES 89 vi

7 INTRODUCTION With advances in genotyping technology, including rapid increases in the number of genetic markers available for QTL studies, association analysis is now a viable approach for the dissection of complex genetic traits (CHURCHILL et al. 2004). Trait Analysis by Association, Evolution and Linkage, or TASSEL, makes use of the most advanced statistical methods to maximize statistical power for finding QTL. Both a structured association approach (PRITCHARD et al. 2000; THORNSBERRY et al. 2001) and a unified mixed model method have been implemented to minimize the risk of false positives by integrating population structure and family relatedness within populations (YU et al. 2006). To aid in the interpretation of results, TASSEL allows for linkage disequilibrium statistics to be calculated and visualized graphically. Linkage disequilibrium is estimated by the standardized disequilibrium coefficient, D, as well as r 2 and P-values. Diversity analysis tools are also available, where diversity estimates include average pair-wise divergence (π) and segregating sites. Other notable features of TASSEL include a sequence alignment viewer, extraction of SNPs and indels (insertions & deletions) from alignments, a neighbor-joining cladogram, and a variety of data graphing functions. TASSEL can merge data from different sources into a single analysis data set, impute missing data using a k-nearest-neighbor algorithm (COVER and HART 1967), and conduct principal components analysis (PCA) to reduce a set of correlated phenotypes. Due to the increasing availability of open data sources, TASSEL utilizes a data browser from the Genomic Diversity and Phenotype Connection (GDPC) project (CASSTEVENS and BUCKLER 2004) to provide an interface to relational databases. As a result, TASSEL users can access many open data sources, including genomic diversity data such as SNPs, SSRs and sequence alignments, as well as phenotypic data collected from field, genetic or physiological experiments. By incorporating this middleware which functions by means of a common graphical interface, TASSEL users are freed from creating a SQL query for each new database. At the time of this writing GDPC is connected to Panzea, Gramene, Germinate, and GRIN. TASSEL is written in Java, thereby enabling its use with virtually any operating system. It can be installed very simply using Java Web Start technology, which automatically checks for the most recent version of TASSEL each time the application is instantiated. In addition, Java Web Start will ensure that the correct version of Java Runtime Environment is running, thus avoiding complicated installation and upgrade procedures. 1

8 1 Getting Start The TASSEL software package can be installed using several different methods. At the user level, Web Start or Stand-alone installation can be used. Alternatively, at the developer level, the open source code can be downloaded and used to build the software package. 1.1 Web Start TASSEL can be installed very simply using Java Web Start technology, which automatically checks for the most recent version of TASSEL each time the application is instantiated. In addition, Java Web Start will ensure that the correct version of Java Runtime Environment is running, thus avoiding complicated installation and upgrade procedures. To begin, Java Web Start (JWS) must first be installed (prior to the installation of TASSEL). JWS is included as part of Java Runtime Environment (JRE) 5.0 and above. After the successful installation of JWS, TASSEL can be launched by clicking the following link: If you will be using TASSEL frequently and would prefer to launch the application from your desktop rather than by revisiting the website, Java Web Start can be used to manually launch TASSEL each time and/or to create a shortcut. Access the Java Application Cache Viewer by going to Start > Settings > Control Panel > Java. From the General tab, click on Settings in the Temporary Internet Files section and then click on View Applications and the Java Application Cache Viewer will appear. (Another way of achieving this is by going to Start > Run and typing in javaws). The TASSEL icon should now be visible and can be used to launch the application. Shortcuts can be created from the menu of the Java Application Cache Viewer: Application > Install Shortcuts. 1.2 Stand-alone Downloading a stand-alone version is recommended for anyone who has a slow Internet connection. While Java Web Start is a very good way of deploying software, it does not ask the user before attempting to download updates. Thus, a slow Internet connection may start a download process that requires an unreasonable amount of time to complete. If you are not interested in disabling your network connection each time before starting TASSEL, we recommend downloading the stand-alone version which does not attempt to update the program. However, given that TASSEL is a Java application, a Java Runtime Environment (version or greater) is still required for installation. To run the stand-alone version, double-click on the JAR file (TASSEL<version #>.jar). Alternatively, at a command prompt (in Windows go to Start > Run and type in cmd or command, then type cd to access the directory containing the stand-alone jar file) type the following: java jar <name of tassel jar file> 1.3 Open source code Open source code for the TASSEL software package is available at: The package uses the standard PAL library ( the COLT library ( and jfreechart ( Database access is achieved by GDPC middleware ( 2

9 1.4 Menu The TASSEL menu has several features designed to save the user time and effort File Save Data Tree This feature allows you to save the entire contents of the Data Tree panel to a default location. This is helpful when the user does not wish to recreate a Data Tree panel that is already well populated with information the next time he/she initializes the program. To save a Data Tree, select File > Save Data Tree Open Data Tree To restore a Data Tree that was saved during a previous instantiation of the program, select File > Open Data Tree Save Data Tree As To save the contents of a Data Tree panel to a specific location or to give them a specific name, select File > Save Data Tree As. A dialog will then appear which allows for user input. NOTE: The information outlined above for saving a Data Tree is applicable to files that are, in general, version specific. A Data Tree saved using TASSEL version may reopen using version or it may not. We recommend that you include the TASSEL version number when naming your saved Data Tree file so that, if necessary, you can download a particular standalone version of TASSEL from the downloads archive at Open Multiple Alignment File This feature is used to open multiple sequence alignment flat files that are in FASTA format Save Selected Dataset As This feature allows the user to save a data set as a flat file in Phylip sequential format by selecting a sequence alignment Tools The tools menu contains contingency test and option to set preference. 3

10 Contingency Test This utility calculates a chi-square contingency test or Fisher exact test (when using only the 2 x 2 table of observations) using the same algorithm as is used in determining linkage disequilibrium Preferences The Quality Score Colors tab, found in the Preferences dialog box, allows the user to set cutoff values for visualizing quality score values on a sequence alignment or a set of called SNPs. 4

To set a desired threshold, simply adjust the slider on the left side of the dialog. Ns, - (dashes), and alignments without any quality score information have a default value of -1 (minus one). 1.

11 To set a desired threshold, simply adjust the slider on the left side of the dialog. Ns, - (dashes), and alignments without any quality score information have a default value of -1 (minus one). 1.5 Panels TASSEL is organized into four main panels: the Options panel, Data Tree panel, Report panel and Main panel. The Options panel runs across the top of the users viewing area. All method buttons are located within this panel. The Data Tree panel is located beneath the Options panel on the left side of the viewing area. This panel organizes retrieved data and results into a structured format. Dataset(s) displayed in the Data Tree panel must first be selected before a desired function or analysis can be performed. To select multiple datasets, press the CTRL key in combination with mouse selecting. The Report panel is located on the bottom left of the viewing area. It displays information about a selected dataset(s) from the Data Tree panel, such as the type of data and how it was created. The Main panel occupies the right side of the viewing area. It displays the content of a selected dataset(s) from the Data Tree panel. 5

Option panel Main panel Data tree panel Report panel When a sequence alignment is loaded into TASSEL, users have the option of several different views.

12 Option panel Main panel Data tree panel Report panel When a sequence alignment is loaded into TASSEL, users have the option of several different views. To select your preferred view, hold down the Control button while clicking on the Main Panel (or simply right-click if you are using a two-button mouse). This brings up a context-sensitive menu with the following options: Match: Displays only those bases that do not match the top row. All other bases appear as.. Color Based on Top Row: All bases are displayed, but those bases that differ from the top row will be red in color. Color Consensus: All indels and all sites of variation will be red in color. Color Based on Quality Scores: If the alignment or called SNPs has associated quality scores, bases will be colored according to the cutoffs set in Tools > Preferences > Quality Score Colors from the Menu. 1.6 Method Buttons 6

13 All method buttons are placed in options panel. The buttons on top row are the buttons for Mode Selectors, Delete, Progress Bar, Print, Save and help. The buttons on the bottom row depend the mode of chosen. By default, the buttons for Data Mode are displayed. To make something happen in TASSEL, data must be imported into data tree panel. Click data set in the Data Tree and then click on the buttons across the top to work with those data set. All of TASSEL s functions are accessed from the button toolbar which is found below the Menu and above the Main Panel and Data Tree. The button toolbar has three buttons which divide the applications behavior into three categories or modes: 1) Data: import and manipulate data sets 2) Analysis: linkage disequilibrium, diversity, selection, association, etc. 3) Results: visualization tools The various functions available of each mode are described in the next three main sections. 7

14 2 Data Mode Data mode serves the purposes of importing data and preliminary data manipulation. Data mode is the default mode when TASSEL starts. The Data mode can be switched back from other mode (Analysis and Result) by clicking the Data mode button. Tassel has two ways of importing data. One way is via GDPC to import data from databases. The other way is through flat files in format of SNP genotype, polymorphism genotype, phenotype (Trait), population structure and kinship. The preliminary data manipulations includes filtering data by site or taxa, joining data (union or intersection), and etc. 2.1 GDPC Genotype and phenotype data generated from numerous genomic research projects are still valuable resources for the public, even after results are published. Some of these data have been migrated to several databases and can be accessed using Genotype Data and Phenotype Connection (GDPC). GDPC is middleware that eliminates the need for end users of data to understand various database schemas and write raw SQL queries to extract data. Instead, the GDPC browser provides a single, easy-to-use interface which can extract genotype and phenotype data from a variety of sources (CASSTEVENS and BUCKLER 2004). Currently, GDPC has connections to the following databases: Gramene diversity for maize, wheat and rice (JAISWAL et al. 2002; WARE et al. 2002a; WARE et al. 2002b; YAMAZAKI and JAISWAL 2005). Panzea (CANARAN et al. 2006; DU et al. 2003; ZHAO et al. 2006). GRIN: Germinate: GDPC has been integrated into TASSEL. Operation of GDPC in TASSEL is very similar to the operation of GDPC alone. GDPC can be displayed in TASSEL by clicking on the GDPC button in Data mode. 8

15 An additional feature allows TASSEL users to load GDPC data into a TASSEL dataset. Data is available for import once the user has parameterized a query and data is visible in either the Genotypes or Phenotypes tab. To load data, activate either the Genotypes or Phenotypes tab (depending on the data you wish to import) and then click the Load button. For additional information about GDPC, please see File This function provides multiple options to Import files for genotypes, phenotype, populations structure and kinship. Several genotype formats are allowed for genotype data, including sequence, FASTA, polymorphism and annotated alignments. Phenotype and population structure can be imported as numerical trait or covariates. Kinship must be load as squared numerical matrix in order to be used by mixed model. TASSEL datasets generally consist of two parts: a header that defines data structure and a body containing the main data. The first row of header usually contains number of rows, number of columns (or number sequences) and number of rows that name the columns. Tabs used as the delimiters. Missing values are represented by -999 for numerical data, and N,?, or - for genotype data. Users can either specify the file type or use the Guess option to let the program to determine the file type through a dialog box as follows. 9

16 Here we use the Guess function to import all the files from the tutorial dataset. After highlighting all the file and clicking Open button, these files will be displayed on the data tree window. 10

2.2.1 Sequence alignment TASSEL accepts the following format for haploid sequence alignment file: TaxaNumber SiteNumber Taxa1Name Sequence Taxa2Name Sequence Note: TASSEL s treatment of taxon names

17 2.2.1 Sequence alignment TASSEL accepts the following format for haploid sequence alignment file: TaxaNumber SiteNumber Taxa1Name Sequence Taxa2Name Sequence Note: TASSEL s treatment of taxon names is case-sensitive. For example, b73 and B73 are treated as separate and distinct taxa. Here is an example: 6 30 A272.ID2 A554.ID4 A6.ID5 B103.ID8 AG--GGGGGG GGGGGGGGGG GGG-GGGGGG AGGGGGCG-G GGGGGGGGGG GG--GGGGGG GGGGGGCG-- GGGGGGGGGG G---GGGGGG GGGGGGCG-- -GGGGGGGGG GGG---GGGG 11

18 B14A.ID10?????????????????????????????? B73.ID13 G G TASSEL also accepts some NEXUS formats, so we encourage some trial and error by the user. Files can be interleaved or not and include spaces or not Polymorphism alignment To input multiple allele SSRs, or SNP data, a file must contain a heading with first row consists of three numbers. The first is the number of individuals or lines. The second is the number of loci. The last is the number of ploy (2 for diploid). The first two numbers are separated by a Tab. The last two numbers are separated by a colon. The second row of the heading contains names of loci. In situation of haploid, the ploidy and the colon in front of can cab be omitted. The body consists of one column for the identification of individual or line, followed by additional columns for genotypes over different loci. Columns in the body are separated by a Tab. For each locus, the home alleles are separated by a colon. A missing value is indicated by a question mark (?). Example 1: Haploid SSR data 99 6:1 S01 S02 S03 S04 S05 S06 TX TX TX ? TX Or 99 6 S01 S02 S03 S04 S05 S06 TX TX TX ? TX Example 2: Diploid SSR data 99 6:2 S01 S02 S03 S04 S05 S06 TX01 123: : : : : :228 TX02 121: : : : : :230 TX04 119: : : : :227?:? TX05 119: : : : : :228 Example 3: Diploid SNP data 12

19 99 6:02 S01 S02 S03 S04 S05 S06 TX01 T:T A:T T:T A:A A:T G:T TX02 T:T A:A A:A A:A T:T T:T TX03 A:A T:T T:T T:T A:A T:T Annotated alignment Here is an example: <Annotated> <Transposed> Yes <Taxa_Number> 5 <Locus_Number> 3 <Poly_Type> Catagorical <Delimited_Values> No <Taxon_Name> MO001 MO003 MO005 MO007 MO008 <Chromosome_Number> <Genetic_Position> <Locus_Name> <Value> 1 0 umc1354 BABAB umc1566 AABAB AY AABAB Data for the above annotated alignment would be displayed as follows: Numerical trait or covariate To input trait values or phenotypes by flat file, a text file can be used. Population structure information can be entered using the same format. TASSEL accepts the following format: TaxaNumber TraitNumber HeaderRows Header1.1 Header1.2 etc. (These are usually trait names) 13

20 Header2.1 Header2.2 etc. (These are usually environment names) Taxa1Name Value1 Value2 Taxa2Name Value1 Value2 Example 1: Three traits on 301 maize lines EarHT dpoll EarDia Example 2: Population structure of 286 maize lines Q1 Q2 Q3 STIFFSTALK NONSTIFF TROPICAL Missing values are indicated by Header rows are optional--if more than two are used, the additional headers will be ignored. Population structure information can be loaded from flat files in the same manner as trait files Square numerical matrix Kinship can either be calculated externally from pedigree by using SAS Procedure Inbreeding (SAS 2002.) or from markers by using software packages such as SPAGedi (HARDY and VEKEMANS 2002). If n represents the number of taxa, the format for kinship files is as follows: n Taxa1Name r11 r12 r1n Taxa2Name r21 r22 r2n TaxanName rn1 rn2 rnn Here rij (i, j=1,2,, n) is the element in the kinship matrix located at row i and column j. Missing values are not allowed for kinship matrix. 14

Important note: The current format in TASSEL 2.1 and after is different from the format in previous version. 2.3 Sites Subset data The alignment can be filtered in several ways.

21 Important note: The current format in TASSEL 2.1 and after is different from the format in previous version. 2.3 Sites Subset data The alignment can be filtered in several ways. Monomorphic sites can be eliminated, and various regions of a sequence can be eliminated. Minimum Count - the minimum number of taxa in which the site must have been scored to be included in the filtered dataset (GAP or missing data does not count). Minimum Frequency - the minimum frequency of the minority polymorphisms for the site to be included in the filtered data set. Start Position, End Position, Distance from End establishes the range of sites for filtering. Extract Indels - if selected, indels are extracted from the alignment. If not selected, only point substitutions are extracted. Remove minor SNP states converts tertiary and rarer states to missing data (? ), thereby forcing sites to have only two types of segregating sites at a locus. This can help remove bad sequence effects. 15

Generate haplotypes via sliding window creates haplotypes from an ordered set of SNPs. 2.4 Taxa Subset data: remove/select specific taxa from a data set.

22 Generate haplotypes via sliding window creates haplotypes from an ordered set of SNPs. 2.4 Taxa Subset data: remove/select specific taxa from a data set. This sub setting is achieved by selecting either genotypic, phenotypic, or population structure data from the data tree. The resulting dialog box displays the selected data in table format. By using either the CTRL or SHIFT key in conjunction with mouse clicking, the user can select or deselect taxa rows. Once desired taxa has been selected, the Capture Selection or Capture Unselected buttons will send the desired data back to the data tree as a new dataset. 2.5 Genotype Converter This button generates a data set that represents allele heterozygosity, including additive and dominance models. 16

23 Genetic analysis requires that genotype data be converted from allele format into formats more compatible with genetic analysis tools. One such format is genotype status. To convert genotype data to genotype status, highlight the genotype data you wish to convert and then click the Genotype button located in the Options panel. In the resulting dialog box, select the Create alignment based on genotypic state (eg. A:a > Aa) option and then click OK. 2.6 Transform, Standardize and Impute Data This suite of functions allows multiple data manipulation on genotype and phenotype (numerical) data. When a genotype date is selected, a numerical transformation is performed for genotype data. When a numerical data is selected, mathematical transformation, data imputation and principal component analysis (PCA) are performed. The Transform The columns tags will be displayed in a Data dialog box with three tabs: Trans, Impute and PCA Genotype numericalization Two options are provided to transform genotype from character to numerical as shown in the following dialog box. 17

Selecting the Standardize checkbox will transform data by subtracting the column mean from the value and then dividing by the column

24 2.6.2 Transform and/or Standardize Data The Trans dialog box is the default selection. It has the display as follows. In the Column list, select the column(s) you wish to transform. Then select the type of transformation you wish to execute. Selecting the Standardize checkbox will transform data by subtracting the column mean from the value and then dividing by the column s standard deviation. Clicking on the Create Dataset button will result in the placement of a data set containing only the selected columns in the Data Tree. 18

25 2.6.3 Imputing Missing Data The k-nearest-neighbor algorithm (COVER and HART 1967) is used to impute missing data. Click on the Impute tab to display the following: PCA Principal component analysis can only be performed on a numerical data without missing values. Two options are available: correlation or covarianc. 19

26 2.7 Synonymize Taxa Names This button makes taxa names uniform to permit the joining of data sets. Both the union and intersection joining functions that generate fused data sets work by matching taxa names. Consequently, if multiple names exist for a given taxa (an added suffix, alternative spellings, different naming conventions, etc.) then the two data sets will not join correctly. To help remedy this, the Synonymizer function allows the taxa names of one data set to replace the taxa names of the second data set. It relies on an algorithm that calculates the degree of similarity between names, using that name which is most similar to those in the first data set. When using the Synonymizer, keep in mind that order of selection matters. Always select the data set with the names you wish to use (the real name) first, and then, while holding down the CTRL key, click on the second data set with the taxa names you wish to change (the synonym ). Then click on the Synonymizer button. A synonym data set will be placed on the Data Tree panel under Synonyms. Each name in the data set selected second is now listed in the TaxaSynonym column. Next to this column is a TaxaRealName column listing the highest scoring match derived from the real name data set. The MatchScore column gives an indication of the amount of similarity between the two names (where 0 is no similarity and 1.0 is identity). 20

Caution! Before the synonyms are applied, we strongly encourage the user to check the match score, especially for those taxa with low match scores.

27 Caution! Before the synonyms are applied, we strongly encourage the user to check the match score, especially for those taxa with low match scores. Once it has been determined that these taxa were interpreted correctly, the synonyms can be applied as following: With the synonyms selected, hold down the CTRL key while clicking on the second/synonym data set (the data set whose names you would like to change). Then once again click on the Synonymizer button to apply the new names to the data set. In the event that some pf taxas are not interpreted satisfactorily, matches can be modified manually. Select the taxa you wish to modify on the left side, and then choose a replacement taxa from the right side. Click the arrow button to substitute the taxa. Click OK to save the change. 21

28 2.8 Union Join This button joins multiple datasets by a union of their taxa. Missing data will be inserted if taxa is missing from one data set. To achieve this function, select multiple datasets using the CTRL key in conjunction with mouse clicks, and then click on the union button to join the datasets. Because this function uses taxa names to join datasets, any variation in taxa names can prevent proper joining. Taxa names can be made uniform by the Synonymizer. 2.9 Intersection Join This button joins multiple datasets by the intersection of their taxa. Taxa must be present in both datasets to be included. 22

29 To achieve this function, select multiple datasets using the CTRL key in conjunction with mouse clicks, and then click on the intersection button to join the datasets. Because this function uses taxa names to join datasets, any variation in taxa names can prevent proper joining. Taxa names can be made uniform by the Synonymizer. 23

3 Analysis Mode Analysis mode consists of the following options: Each option is described in detail below: 3.1 Diversity This button executes a basic analysis of diversity.

30 3 Analysis Mode Analysis mode consists of the following options: Each option is described in detail below: 3.1 Diversity This button executes a basic analysis of diversity. Average pairwise divergence (π), segregating sites, and θ estimates (4Nμ) can be calculated, as well as sliding windows of diversity. To run a diversity analysis, click on a raw sequence alignment, and then select Analysis Diversity. In the resulting Diversity Surveys dialog box, the various site classes available for analysis are listed on the left. If the sequence has no annotation, then only the Overall and Indels options will be active. A sliding window of diversity can also be calculated across the region. To produce a sliding window, check the box next to Sliding Window, and then enter the desired step size and size of the sliding window. Results can be plotted using Results Chart or viewed in a table via Results Table. 24

31 3.2 Linkage Disequilibrium This button generates a linkage disequilibrium data set from SNP data. NOTE: It is important to use only filtered datasets (apply Data Sites first) when estimating linkage disequilibrium, as a raw alignment with numerous invariant bases will take a very long time and consume a large amount of memory to calculate. Linkage disequilibrium between any set of polymorphisms can be estimated by clicking on a filtered set of polymorphisms and then using Analysis Link. Diseq. At this time, D', r2 and P-values will be estimated. The current version calculates LD between haplotypes with known phase only (genotypes are not supported; see PowerMarker or Arlequin for genotype support). D' is the standardized disequilibrium coefficient, a useful statistic for determining whether recombination or homoplasy has occurred between a pair of alleles. r 2 represents the correlation between alleles at two loci, which is informative for evaluating the resolution of association approaches. D' and r2 can be calculated (Weir 1996) when only two alleles are present. If multiple alleles are present, a weighted average of D' or r2 is calculated between the two loci (Farnir 2000). This weighted average is determined by calculating D' or r2 for all possible combinations of alleles, and then weighting them according to the allele's frequency. Note: It is not entirely certain that this procedure fully accounts for allele number effects. P-values are determined by two methods. If only two alleles are present at both loci, then a two-sided Fisher's Exact test is calculated. NOTE: Previous editions of TASSEL used a one-sided test, but TASSEL version and later use a two-sided test. If more than two alleles are present, permutations are used to calculate the proportion of permuted gamete distributions that are less probable then the observed gamete distribution under the null hypothesis of independence (Weir 1996). When calculating linkage disequilibrium, users have the option of employing Rapid Permutations. If this option is selected, the algorithm will compute either a fixed number of permutations or run until 10 permutations are found that are more significant than the observed P-value. While this slightly reduces P- values, it also saves a large amount of computational time. If an unbiased p-value is desired, then the user must unselect the Rapid Permutations check box. 25

Linkage disequilibrium results can be plotted using Results LD Plot or viewed in a table via (Results Table). Farnir, F., Coppieters, W., Arranz, J.-J., Berzi, P., Cambisano, N., Grisart, B.

32 Linkage disequilibrium results can be plotted using Results LD Plot or viewed in a table via (Results Table). Farnir, F., Coppieters, W., Arranz, J.-J., Berzi, P., Cambisano, N., Grisart, B., Karim, L., Marcq, F., Moreau, L., Mni, M., Nezer, C., Simon, P., Vanmanshoven, P., Wagenaar, D. & Georges, M. (2000). Extensive Genome-wide Linkage Disequilibrium in Cattle. Genome Res 10, Cladogram This function generates a tree or cladogram data set. TASSEL produces neighbor-joining trees using only simple parsimony substitution models. To retrieve cladogram data, first select genotypic data from the Data Tree panel and then click on the Analysis button, followed by the Cladogram button. The resulting tree data and the corresponding matrix will appear as separate data sets on the Data Tree panel. Results can be plotted using Results Tree Plot. 3.4 Kinship Matrix The function generates a kinship matrix from a set of random SNPs. To do so, first highlight SNP data then click on the Analysis button, followed by the Kin button. The resulting kinship data will be added as a data set on the Data Tree panel. 26

When a genotype file is selected, the kinship matrix is generated by converting the distance matrix calculated from TASSEL s Cladogram function to a similarity matrix by subtracting all values from 2.

33 When a genotype file is selected, the kinship matrix is generated by converting the distance matrix calculated from TASSEL s Cladogram function to a similarity matrix by subtracting all values from 2. Kinship can be derived from a set of random SNP data (a minimum of several hundred SNPs spread over the whole genome is recommended). Users may also load their own kinship data using Data Kin. Kinship matrices can be calculated using the SPAGeDi sortware package ( A kinship matrix can also be calculated based on percent shared alleles (Stich et al. 2008) 3.5 General Linear Model This function performs association analysis by a least squares fixed effects linear model. TASSEL utilizes a fixed effects linear model built by the user to test for association between segregating sites and phenotypes, while also accounting for population structure. To establish the model, the user must first define the input data types so that the software knows how to treat each input variable. In the Input Data Definition window, users will find two columns marked Input and Type. Each row in the Input column represents data from the phenotypic data set. A corresponding drop down box in the Type column allows users to indicate whether that row contains data, a covariate, or a factor (classification variable), or whether it should be excluded from the analysis. It is important that each variable be correctly identified in order to obtain valid results. A separate analysis will be performed for each trait. If the box is checked for Analyze each column of data separately, then each column will be analyzed independently of the others. For example, if the same trait is analyzed in three separate environments, each environment will be analyzed separately rather than combined into a single analysis. Population structure can be incorporated into the model by using covariates that indicate percent contribution to each taxon by specific populations. These must be identified as covariates not as data. If population structure percents sum to 100% for each taxon, one of the populations should be excluded from the analysis in order to obtain valid F-tests of the population 27

covariates. If the covariates are linearly dependent, then the F-test for each of them will be zero because the F-test for each covariate is calculated after fitting all the others.

34 covariates. If the covariates are linearly dependent, then the F-test for each of them will be zero because the F-test for each covariate is calculated after fitting all the others. Each item in the Data Headers section represents a header in the input file. Data imported from GDPC will have only two header rows: trait and environment. Data imported from flat files can contain additional headers. These additional headers may indicate additional levels of nesting, such as locations and reps within locations. The first header row is assumed to be trait and the second is assigned a label of env (for environment), which can be changed by the user. The user must type in a label for any additional headers or they will not be utilized in the analysis. Pressing OK on the Input Data Definition window opens the GLM Dialog. This dialog box allows the user to select the effects to be used in the model: To specify main effects, select one or more input fields and then click the > button. To specify an interaction effect, select the factors involved in the Available Factors list then click the * button. Hold down CTRL to make multiple selections. To specify a nested effect, include it as part of an interaction but not as a separate main effect. To remove an effect from the model, click <. To add or remove all traits click >> or << The next step will be either to run the model or to modify the default output to be generated by the analysis. Clicking the button labeled Define Output displays the F-tests that will be included in the output as well as any effect estimates to be generated. F-tests are defined by selecting one factor for the hypothesis and one for the error term, then clicking Add F-Test. A permutation test for the significance of an F-test will be calculated if the user enters the number of permutations in the Ftest table. A 28

permutation test can only be requested when the residual is used as the error term. Additionally, means will be calculated for a selected hypothesis by clicking the add LSMean button.

35 permutation test can only be requested when the residual is used as the error term. Additionally, means will be calculated for a selected hypothesis by clicking the add LSMean button. When data for taxa are replicated, the proper error term for testing markers is not the residual but taxa nested within markers. This term can be specified in the model by including marker:taxa but not including taxa as a main effect. Alternatively, the analysis can be run using the taxa means rather than the replicated data. In fact, taxa means must be used in order to specify a permutation test because permutations will not be run if the error term is marker:taxa. The reason is that if permutations are run using anything except the model residual as the error term, the analysis takes much longer. Calculating the taxa means then using that data set for the permutation test of markers gives approximately the same results and saves a lot of computation time. Means for taxa can be calculated by selecting the phenotype data set and using GLM to generate a taxa means data set, which can then be combined with genotype and population data as usual. In addition, GLM can estimate effect means using a Fusion data set to provide greater flexibility in subsetting the data. When data is replicated across environments, a good scenario to follow is to use GLM without permutations to test markers and marker:environment interactions. When marker:environment interaction is significant, the user should consider analyzing each environment separately. Taxa means can then be 29

36 used to test markers across locations if a permutation test is desired. In the presence of marker:environment interaction, care should be taken in the interpretation of tests of markers across locations. Taxa means should be calculated without population covariates or markers in the model. Population covariates could then be included in the subsequent analysis of marker association based on the taxa means. Background: The association test computes a least squares solution (SEARLE 1987) to the fixed effects linear model constructed in the GLM dialog. The software uses a Σ-restricted model, which leads to a marker SS that is a good test of marker effects even when data is unbalanced. Data is unbalanced when there are an unequal number of observations for marker states or for other main effect levels. For each effect in the model, the sum of squares is computed after fitting all other model effects. In the case of missing data (no values for some marker by main effects combinations), the resulting F-test for markers is not exact in the presence of significant interaction effects and care must be taken in interpreting the results. In general, one must exercise care in interpreting tests of main effects when interactions are significant. The primary purpose of the permutation test (CHURCHILL 1994) is to provide a test of significance that corresponds to the experiment-wise error in order to correct for the fact that multiple comparisons are being made. The output lists both a site-wise and an experiment-wise permutation test. After each permutation, the p-values for each site are recalculated. For the site-wise value, the new p-value for each site is compared to the original p-value for that site, which was calculated from the unpermuted data. For the experiment-wise value, the minimum p-value across all hypotheses from that permutation is compared to the original p-value for each site. The value reported in the output table is the percentage of instances in which the permutation p-value is lower than the original p-value. The site-wise value should be very close to the p-value from the original F-test for normally distributed data. The experiment-wise value reflects the experiment-wise error rate, correcting for multiple comparisons, and is the best value to use in making decisions about significance of marker effects. Churchill and Doerge recommend using at least 1000 permutations for a.05 level of significance. Permutations are carried out for simple models containing only a single main effect by randomly shuffling the data across all observational units. For more complex models, the residuals from the reduced model are shuffled and added back to the individual observation estimates. The reduced model is the model which contains all the effects except the one being tested (Anderson and Ter Braak 2003). Description of Output: The following table shows an example of an output table from a simple analysis: In addition to displaying the F-statistics and p-values for the requested F-tests, the table also contains information about degrees of freedom, the error mean square for the model, R-square of the model, and R- square for the marker. The model R-square is the portion of total variation explained by the full model. 30

37 The marker R-square is the portion of total variation explained by the marker but not by the other terms in the model. When permutations are requested, #perm_marker is the number of permutations run, p- perm_marker is a test of individual markers, and p-adj_marker is the marker p-value adjusted for multiple tests. The p-adj_marker value is a permutation test derived using a step-down MinP procedure (Ge et al. 2003) and controls the family-wise error rate (FWER). For example, if only markers with p-adj values of.05 or less are accepted as significant, then the probability of rejecting a single true null hypothesis across the entire set of hypotheses is held to.05 or less. This test takes dependence between hypotheses into account and does not assume that hypotheses are independent as do other multiple test correction procedures. 3.6 Mixed Linear Model This conducts association analysis via a mixed linear model (MLM). Background: Complex traits are usually controlled by multiple QTL. The primary goal of QTL mapping is to find associated markers for each QTL. One common approach of association studies is to fit markers into a linear model as fixed effects. Any QTL not linked to the markers contribute to the residual error, thus inflating the error term and reducing statistical power. Further, if the remaining QTL are not randomly distributed across the tested markers, they can cause spurious associations between phenotypes and the tested markers. These false associations can be partially corrected with a structured association method that uses a Q-matrix of population membership estimates, and further reduced by incorporating multiple background QTLs as a random effect in a mixed model based on the premise that the genome of each individual or line is a sample of its population s gene pool. The average relationship between individuals or lines can be estimated by kinship (K) calculated from either pedigree or a suitable amount of random markers across the entire genome. The Q + K method, which combines information from both Q and K, was shown to be superior to more conventional linear models in association analyses (YU et al. 2006). The Q + K method has been implemented in TASSEL as a MLM (mixed linear model) function. The statistical model can be described in Henderson s notation (HENDERSON 1975) as follows: y = Xβ + Zu + e where y is the vector of observations; β is an unknown vector containing fixed effects, including genetic marker and population structure (Q); u is an unknown vector of random additive genetic effects from multiple background QTL for individuals/lines; X and Z are the known design matrices; and e is the unobserved vector of random residual. The u and e vectors are assumed to be normally distributed with null mean and variance of Var u = G 0 e 0 R where G = σ 2 a K with σ 2 a as the additive genetic variance and K as the kinship matrix. Homogeneous variance was assumed for residual effect which means R=I σ 2 e, where σ 2 e is residual variance. The REML estimate of σ 2 a and σ 2 e were obtained through the expectation and maximization algorithm (LAIRD and WARE 1982). Disadvantages of using a mixed linear model The number of equations to be solved in a mixed linear model is usually two or more orders of magnitude greater than with a GLM. In addition, the solution to each of these equations is achieved by means of iteration. Consequently, a mixed linear model requires much more computational time. 31

38 MLM provides the user with a few different options for solving mixed models. These options are available on the Heritability dialog box. The user can choose a method named P3D to estimate heritability, can estimate the variance components and heritability separately for each marker, or can use a previously determined heritability. P3D uses the mixed model specified by the user but without a marker effect to estimate the variance components, σ 2 a and σ 2 g. Tests of significance for each marker are calculated using those initial estimates. When a heritability value is supplied by the user, the variance components can be calculated directly from the heritability without iteration. In addition, the user can specify an analysis method. The available methods are EM and EMMA. The EM method uses an expectation-maximization algorithm to derive a restricted maximum likelihood (REML) estimate of the variance components. The EMMA method uses a Java implementation of the efficient mixed-model association algorithm (KANG et al. 2008) Using the P3D method or using a prior heritability can result in a substantial time savings with the EM method compared to calculating heritability for each marker. The same two methods do not provide much time savings when using the EMMA method. In the Parameters dialog that follows this screen, the EMMA method does not use the Aitken Step Size parameter or maximum number of iterations. 32

39 3.7 Structured Association Test Association analysis by logistic regression. This function will carry out associations between segregating sites and phenotypes, while accounting for population structure. No experiment-wise error correction is provided. Because biallelic genotypic data is used, any rare, multi-allelic states will be fused. A logistic regression ratio test is used to evaluate associations involving quantitative traits while controlling for population structure (Thornsberry, 2001). In the null hypothesis H0, candidate polymorphisms are independent of phenotype, while in the alternative hypothesis H1, candidate polymorphisms are associated with the phenotype. The probability of each hypothesis is compared in the following way: Pr1 ( C; T, Qˆ) Λ = Pr ( C; Qˆ) 0 where C is the genotype of the candidate polymorphism for all lines, and T is the trait value for all lines. In this test, the difference in the natural logarithm likelihoods of the model with (Pr1) and without (Pr0) the trait is the test statistic Λ. Since the distribution of Λ is not known precisely, permutations should be used to determine significance. If several sites with high LD are being scored, then the maximum Λ over all sites is used as the test statistic Λmax. Permutations are calculated based on this Λmax statistic. Permutations to determine significance: These statistical tests will result in a p-value associated with each polymorphism-trait pair. However, for many association tests there will be tens or hundreds of polymorphisms to test. The normal modification for multiple tests would be a Bonferroni correction, but this is far too conservative for highly correlated polymorphisms. The goal of permutations is to determine the number of independent tests, which is confounded by LD, and account for non-normality in trait distributions. The trait values should be permuted relative to the fixed haplotypes (CHURCHILL 1994), and then associations recalculated for 100 to 1000 permutations. (PRITCHARD et al. 2000) suggests permutations based on population structure, and this approach is implemented by TASSEL. To run a structured association test, select a dataset that contains phenotype, polymorphisms, and population structure information and then click on Analysis Struct. Assoc. to open the menu below. Use the arrow keys to move estimates of population structure into the Population Structure Estimates list. To ensure a quality significance estimate, make sure the permutation box has been checked and enter an appropriate number of permutations. 33

Two different sets of results sets are produced from this analysis. The first, LogRat_Perm_Trait_Gene, provides summary statistics for the locus.

Users are encouraged to consider two key columns: (1) DiffLL the difference in log likelihoods for the two models above (H0 and H1), and (2) PermutedP the p-value based on the permutations described

40 Two different sets of results sets are produced from this analysis. The first, LogRat_Perm_Trait_Gene, provides summary statistics for the locus. These results can be explored using a table or a 2-D Plot. The second result set is LogRat_Trait_Gene. This provides results for individual sites by trait and environment. Users are encouraged to consider two key columns: (1) DiffLL the difference in log likelihoods for the two models above (H0 and H1), and (2) PermutedP the p-value based on the permutations described above. The remaining statistics describe the fit of each model for that specific site. The first line of this table provides information on the very best site-trait-environment combination (NOTE: This may be eliminated in future versions of TASSEL, as this information is captured in the LogRat_Perm_Trait_Gene result table). The permuted p-value for this first site is the locus-wise p-value, equivalent to Site_Statistic_Over_Environment_And_Trait described above (this accounts for multiple testing issues). The remaining p-values are for a specific site, and DO NOT ACCOUNT FOR MULTIPLE TEST ISSUES. 34

3.8 Manual SNP Extractor Manually extracts a SNP from raw gene sequences. The extracted SNP is added under Results SNP Data in the Data Tree panel.

41 3.8 Manual SNP Extractor Manually extracts a SNP from raw gene sequences. The extracted SNP is added under Results SNP Data in the Data Tree panel. To append additional SNPs to the results, hold down the CTRL key while selecting both the raw gene sequence and the extracted SNP data set to which you wish to add. Then click SNP Extr. In the SNP Extractor Dialog, enter the base number in What SNP? For example, if a polymorphism exists at base 95, then type in 95. Continuing this example, the Surrounding Bases feature works as follows: if the user types a 5 in the box, then the SNP will be extracted along with bases 90 to 100. Two format options are available for output: Format 1: GGTAT Y TCTAC Format 2: GGTAT[C/T]TCTAC The resulting data can be viewed or exported using a table. 3.9 Bulk SNP Extraction Extracts SNPs from a raw sequence alignment into a useful format for export. Additionally, this function provides information for designing genotyping assays. Below is a detailed explanation of the SNP Extractor Dialog: 35

Minimum Site Frequency: the minimum frequency for which the site must have a good base Minimum SNP Frequency: the minimum frequency of the minority polymorphisms for the site to be included in the

42 Minimum Site Frequency: the minimum frequency for which the site must have a good base Minimum SNP Frequency: the minimum frequency of the minority polymorphisms for the site to be included in the resulting data set Minimum Surrounding Bases: the minimum number of good bases on at least one side of the SNP Minimum Good SBE Bases: the minimum number of good bases on at least one side of SNP Filter SNPs to Biallelic: converts tertiary and rarer states to missing data (? ), thereby forcing sites to have only two types of segregating sites at any particular locus. This helps to remove bad sequence effects. Results are displayed on the Data Tree panel and include SNPs along with their context in two different formats (IUPAC-IUB ambiguous nucleotide and another see Appendix for more information). Additional information is also provided, including: the location of the nearest polymorphisms on either side, polymorphism information content ( PIC ), and Haplotype PIC. Overall score is essentially an estimate of the ability to design a single-base pair extension reaction in the region. These results can be exported by using a table (Results Table). 36

4 Results Mode Results mode consists of the following options: 4.1 Table Allows data to be displayed in a spreadsheet view and exported into a flat file.

Shown below is an example in which diversity estimates are displayed. Data can be sorted by clicking on the column header of interest.

43 4 Results Mode Results mode consists of the following options: 4.1 Table Allows data to be displayed in a spreadsheet view and exported into a flat file. To create a table, select a data set from the Data Tree panel, then click on the Results button followed by the Table button (Results Table). Shown below is an example in which diversity estimates are displayed. Data can be sorted by clicking on the column header of interest. A secondary sort can be done by holding down the CTRL key and clicking on a second column. Data can be exported to flat files that are either comma-separated (Comma Separated Values = CSV) or tab-delimited. Both these formats can then be imported into a spreadsheet program such as Excel. Tables can also be printed. 37

4.2 LD Plot Displays the results of linkage disequilibrium analysis. After selecting the desired result from the Data Tree panel, click on the LD Plot button while in Results mode (Results LD Plot).

44 4.2 LD Plot Displays the results of linkage disequilibrium analysis. After selecting the desired result from the Data Tree panel, click on the LD Plot button while in Results mode (Results LD Plot). The graph that is generated displays LD between all possible pairs of sites. The black diagonal represents LD between each site and itself. The default setting graphs r 2 in the upper right and p-values in the lower left. This default can be modified by clicking on the radio buttons in the lower left. The left side of the graph contains a text description of the gene (or chromosome) and the site within the gene (or genetic position within the chromosome). At the bottom of the graph is a display of the position of each site along the gene or chromosome. This display can be hidden by deselecting the Schematic checkbox. Legends describing the color scheme appear on the right hand side of the graph. LD plots can be printed, saved in JPEG format, or saved as a Scalable Vector Graphics (SVG) file. An SVG file is useful for creating publication quality graphics which can be easily sized using an editor such as Adobe Illustrator, Corel Draw, and OpenOffice.org Draw

45 4.3 Tree Plot Displays the results of cladogram analysis. After running Analysis Cladogram, select the desired dataset and then click Tree Plot in the Results mode (Results Tree Plot). Trees can be visualized in either a Normal or Circular layout. These image can be printed, saved in JPEG format, or saved as a Scalable Vector Graphics (SVG) file. 4.4 Chart Method for visualizing numeric data. This feature can be used to display histograms, XY plots, bar charts and/or pie charts. Any numeric table data can be charted, including LD results, phenotypic data, diversity results, and association results. Histograms: Use the graph type combo box to select the desired graph type (Histogram) from the list of options. Up to two different series of data can be plotted together. Users may specify the number of bins to be used in the histogram. 39

Select data to be plotted in X and Y axes using the appropriate drop down

46 Scatter plots: Use the graph type combo box to select the desired graph type (XY Plot) from the list of options. Select data to be plotted in X and Y axes using the appropriate drop down boxes. If two data series are plotted simultaneously on the Y axis, the 2 Y Axes checkbox will provide an axis for each. 40

Using the drop down boxes provided, populate rows with Environment, columns with Site, and value with PermuteP.

47 4.5 2D Plot Displays 2D plots and determines color thresholds. This function is useful for plotting associations in multiple environments. Note: This view is limited to Struct Assoc results, as neither GLM nor MLM report results by environment. First select the desired result set from a Structured Association analysis. Using the drop down boxes provided, populate rows with Environment, columns with Site, and value with PermuteP. The cutoff value for coloring can be chosen either by inputting a value in the text box or by using the slider tool to the right of the text box. Users can mouse over any box to view the value associated with that box, as shown here: If P-value coloring is desired, simply check the P-value box as shown below: By checking the P-value box, Cutoff selection tools will be disabled and fields will instead be colored according to the following grayscale: 41

For ideas and/or programming collaboration, TASSEL source code is available on SourceForge. 4.7 Visual Genotype Sorts and displays genotypes using visual color coding from SNP data.

48 This key can be shown by clicking on the? icon next to the P-value check box. 4.6 Gene Plot Allows for gene annotation to be displayed visually. Note: Although this feature is designed for imported data sets, TASSEL does not currently support a method for importing annotated data sets. For ideas and/or programming collaboration, TASSEL source code is available on SourceForge. 4.7 Visual Genotype Sorts and displays genotypes using visual color coding from SNP data. This feature provides a visual abstraction of a genotype in which each SNP is represented by a colored box, thus simplifying identification of major and minor variants. To create a visual genotype, first select a SNP dataset (as generated from a sequence alignment by Data Sites in Section 4.1.7). Then click on VGType (in Results mode). Users can double-click on a box to move that particular taxa to the top row. All rows that share the same variant will then be grouped immediately below it. Another useful feature allows users to pause the mouse icon over any box to reveal specific taxa/sequence information for that particular box (see image below). 42

Checking the Color by Reference box will make the selected taxa (the top row) a single color and will then sort and color SNPs of the remaining taxa relative to the top row.

49 Checking the Color by Reference box will make the selected taxa (the top row) a single color and will then sort and color SNPs of the remaining taxa relative to the top row. Thus, taxa in the bottom rows share the least in common with the selected taxa/top row. As seen above, double-clicking on a box that is already positioned in the top row will make the actual SNP variant visible (verses re-ordering the entire grid). The Refresh button will refresh the view after selecting/deselecting Color by Reference and will also resort taxa by alphabetical order. This graphic can be exported as an SVG file. 4.8 Haplotype Viewer Creates a haplotype network view using SNP or sequence data. This feature allows viewers to create a minimum spanning tree that depicts the relationships (lines) between distinct entities (nodes or boxes). Each line between nodes represents the number of changes between entities. Each node contains the number of taxa that share a given gene variant. To access this feature, first select a raw gene sequence alignment or SNP dataset (as generated from a sequence alignment by Data Sites in Section 4.1.7). While in Results mode, click on HapView. If the resulting network tends to migrate across the screen over time, click the Freeze button to force the network to be static. By clicking and dragging, nodes can be moved to more suitable locations. The Blank button toggles the node numbers. As seen below, users can view the gene variant as well as all taxa that share it by clicking on a node while holding down the CTRL key. 43

50 44

5 Tutorial This tutorial reviews several common scenarios for the use of TASSEL, helping the user better understand its capabilities for data manipulation and association analyses.

This tutorial dataset contains real, raw data on phenotype, genotype, population structure and kinship.

51 5 Tutorial This tutorial reviews several common scenarios for the use of TASSEL, helping the user better understand its capabilities for data manipulation and association analyses. The TASSEL software package includes a tutorial dataset that can be downloaded from the TASSEL website (please unzip all files to a directory of your choice). This tutorial dataset contains real, raw data on phenotype, genotype, population structure and kinship. The results generated by TASSEL are included, as well as results generated by SAS procedure Mixed so that users can compare output for association analysis via a mixed linear model. File details are described in the Appendix. The Tassel 2.1 has user interface as shown 5.1 Importing Data from flat files Flat files are the most common type used for importing data into TASSEL. Flat files should be in the form of Tab-delimited text files. Files can be prepared using the instructions provided earlier in this manual or by following the examples in the TASSEL tutorial dataset. On Data panel, click File button to display File Loader dialog box as follows. At this time we use the default (Guess) function to load following files from the tutorial dataset. 45

52 These files will be displayed in the following sections SNP data 46

53 5.1.2 Phenotype data Population structure data Kinship data 47

5.1.5 Annotated Alignment This format can not be Guess correctly. You have to selection the Load annotated alignment (custom) radio option to load the demonstration file: TestAnnotatedAlignment.txt.

54 5.1.5 Annotated Alignment This format can not be Guess correctly. You have to selection the Load annotated alignment (custom) radio option to load the demonstration file: TestAnnotatedAlignment.txt. The file is displayed in TASSEL as follows. 5.2 Importing data from a database (via GDPC) GDPC, middleware that is integrated into TASSEL, allows the user to import data from a database. To display GDPC in TASSEL, click on the GDPC button in Data mode. General rules for working with databases include: 1. First establish a connection with the database. 2. Then define a query. 3. Finally, once the desired data is in GDPC, load the data from GDPC into TASSEL Connecting with a database To establish a connection with a database, click the Add Conn button followed by the button of the database you wish to add. Then click Ok. In the example below, we chose Panzea. 48

2 Data query GDPC is equipped with several tabs to query data, namely Taxa, Taxon Parents, Loci, Genotype Experiments, Environment Experiment, and Localities.

55 The information about Panzea is displayed as follows: To connect to more than one database, simply repeat the process outlined above. In the following figures, only the GDPC area will be displayed if other areas are deemed irrelevant Data query GDPC is equipped with several tabs to query data, namely Taxa, Taxon Parents, Loci, Genotype Experiments, Environment Experiment, and Localities. Within each tab, any retrieved data will be displayed in the Filtered List. Choose interested attributes by checking the desired boxes (located beneath the Filtered List). These selections subsequently reveals the available values by which to filter data. Here, using the Taxa tab, we chose Germplasm type (field) and then selected Inbred (value for filtering). After clicking the Get Data button, the subset of taxa from the database that meets these criteria are now included in both the Filtered List and the Working List. 49

The button will now appear as. This activates the Add selected items, Add all items, Remove and Remove all buttons.

56 Items listed in the Working List can be modified by the user. To do so, first break the link between the Filtered List and the Working List by clicking on the Link/Unlink button. The button will now appear as. This activates the Add selected items, Add all items, Remove and Remove all buttons. Here we first removed all items from the working list, then selected items with a name starting with the letter D. Clicking on the Add selected items button moved them to the Working List. The Working List is shown as follows. 50

To filter data by polymorphism type, first click on the Genotype Experiments tab, check the Polymorphism Type and Producer checkbox (field), and then select SNP and Jim Holland (value for filtering).

Results for this examples are shown below: Now genotype data can be extracted from the database by clicking on the Genotypes tab, followed by the Get Data button.

57 To filter data by polymorphism type, first click on the Genotype Experiments tab, check the Polymorphism Type and Producer checkbox (field), and then select SNP and Jim Holland (value for filtering). Finally, click the Get Data button to reveal the subset of data that meets these criteria. Results for this examples are shown below: Now genotype data can be extracted from the database by clicking on the Genotypes tab, followed by the Get Data button. After a moment, genotype data will be displayed as follows: Users can either save this genotype data in several forms or upload it to TASSEL. However, before outlining these procedures, let us finish the query by exploring phenotypes. Supposing the user would like to acquire data from experiments conducted in 2000, we first selected the Environment Experiments tab, followed by the Repetition checkbox. We then selected the desired repetitions in 2000 as the values 51

Here we chose Days to Silk listed in the Ontology field. Make sure no Taxa are selected and all Environment Experiments are selected that were retrieved in the previous step.

58 to be used for filtering, and clicked the Get Data button. The subset of data that meets these criteria is returned as follows: Now the user can extract phenotype data by first clicking on the Phenotypes tab. Traits can only be extracted one at a time. Here we chose Days to Silk listed in the Ontology field. Make sure no Taxa are selected and all Environment Experiments are selected that were retrieved in the previous step. Then click the Get Data button. Then click the Merge button, leaving only Accession checked under the Taxa Properties section. Leave Locality and Repetition checked under the Environment Experiments Properties section. Data are merged as follows: Importing GDPC data into TASSEL Genotype and phenotype data must be loaded in separate steps. To load genotype data, first click on GDPC in Data mode. Then click on the Genotypes tab, followed by the Load button. The genotype data 52

Results will look as follows: To load phenotype data from GDPC into TASSEL, first click on the GDPC button in Data mode.

59 is then loaded into TASSEL and labeled as Genotype. To view the uploaded data, click on Genotype within the Data folder in TASSEL. Results will look as follows: To load phenotype data from GDPC into TASSEL, first click on the GDPC button in Data mode. Then choose the Phenotypes tab, followed by the Load button. The phenotype data is then loaded into TASSEL and labeled as 4 traits/environ. To view the uploaded data, select 4 traits/environ from the Phenotypes folder in TASSEL. Results will appear as follows: 53

5.2.4 Saving GDPC query results All query results, including both genotype and phenotype queries, can be saved as either Tab-delimited text files or XML files.

60 5.2.4 Saving GDPC query results All query results, including both genotype and phenotype queries, can be saved as either Tab-delimited text files or XML files. Results are exported as tab-delimited text files by first choosing the Query Tab and then clicking on the Export button, or by clicking the Save As button to save results in XML format. Location and file name must be specified in both situations. Data in XML format can be imported back into GDPC by clicking on the Open button. 5.3 Importing/Exporting data to/from a data tree The TASSEL tutorial data set includes a file named demodata.zip. This file is a binary file that stores a loaded data tree for each type of data (Gene, Polymorphism, Phenotype, Population Structure and Kinship). To import the saved data tree, first click on the File menu, and then click Open to locate the demodata.zip file. Finally, click the OK button to import the data tree. Warning: The current data tree will be replaced with the imported data tree. To save a current data tree, use the File menu to select Save data tree as and then specify a location and name for the saved tree. 5.4 Data manipulation Filtering data While in Data mode, select the data set dwarf8 and then click on the Sites button. A Filter Alignment Dialog will open. Enter the following data by typing in the text boxes: Minimum count of 69 and Minimum freq of 0.05 (5%). Then select Remove minor SNP states. 54

Now click the Filter button. This will add a file named Point in the Data Tree panel. 5.4.

For this tutorial,select three_traits_rn_rename and popstructure_taxa286_rn_rename and then click on the U

61 Now click the Filter button. This will add a file named Point in the Data Tree panel Joining datasets In Data mode, select the datasets you wish to join while holding down the CTRL key. For this tutorial,select three_traits_rn_rename and popstructure_taxa286_rn_rename and then click on the U Join button. This adds the dataset three_traits_rn_rename + popstructure_taxa286_rn_rename to the Data Tree panel, as shown below. 55

5.4.3 Missing value imputation The phenotype file three_traits.txt will be used to demonstrate the process of imputing missing data. Note that the dataset below contains missing values (NaN).

62 5.4.3 Missing value imputation The phenotype file three_traits.txt will be used to demonstrate the process of imputing missing data. Note that the dataset below contains missing values (NaN). To impute missing data, first select the 3 traits/environ dataset in the Data Tree panel and then click the Transform button (Data Transform). The Transform Column Data window will open. Click on the Impute tab in this window. Finally, click on the Create Dataset button to add the new dataset with missing values imputed to the data tree. Note that missing values are now filled. 56

Here we use the dataset from three_traits_rn_rename_3_imputed file with missing values imputed (three_traits_rn_rename).

63 5.4.4 Principal component analysis (PCA) Because PCA requires no missing values, data must be thoroughly checked and any missing values imputed. Here we use the dataset from three_traits_rn_rename_3_imputed file with missing values imputed (three_traits_rn_rename). First select the desired dataset from the Data Tree panel and then click on the Transform button (Data Transform). When the Transform Column Data window appears, click on the PCA tab. The window will now look as follows: 57

64 Clicking on the Create Dataset button will add three datasets to the Data Tree panel, one each for principal components, eigenvectors and eigenvalues. 58

5.5 Using principal components as structure It was shown that principal components had the equivalent effect as population structure estimated through the maximum likelihood approach (Zhao et al.

65 5.5 Using principal components as structure It was shown that principal components had the equivalent effect as population structure estimated through the maximum likelihood approach (Zhao et al. 2007). Principal components can be quickly derived from a set of random markers. Here we use the 553 random marker (Yu et al 2006) for demonstration. First we convert the character data (in format of ATGCN) to numerical data by extracting the major allele and collapsing the non major alleles. Highlight the data and click Transform button to bring next windows: 59

To use the complete data, we impute the missing data before the principal component analysis.

66 After click the Create Dataset button, a new dataset named d8coding_rn_rename_collasped was created. Please notice that a NaN value was filled for missing SNP data. To use the complete data, we impute the missing data before the principal component analysis. Therefore, select the data set of d8coding_rn_rename_collasped, click Transform button to bring next windows: Chose the Impute panel and click Create Dataset button to fill in the missing data as following: 60

We choose two principal components as following: After the Create Dataset button is clicked, the PAC result is

67 Now we are ready to create principal components. First we highlight the imputed data, then click Transform button again and select PCA panel. We choose two principal components as following: After the Create Dataset button is clicked, the PAC result is created, including eigen value, eigen vector and principal components. The first two principal components contribute 9.8% of total variation. They can be used as population structure in association analysis. 61

Association analysis using GLM We use following files from the Tutorial dataset to perform association study by using general linear mode (GLM). Phenotype: three_traits_rn_rename.

68 Association analysis using GLM We use following files from the Tutorial dataset to perform association study by using general linear mode (GLM). Phenotype: three_traits_rn_rename.txt Marker: d8coding_rn_rename.txt Population structure: popstructure_taxa286_rn_rename.txt First we extract SNP from the sequence file (d8coding_rn_rename.txt) by highlighting it and then click Sites button on Data panel to show following dialog box. After the Filter button is clicked, a dataset name Point will be added to data tree. 62

69 63

Now we need to join phenotype, population structure and genotype (Point) dataset together. First is to select these data from the data tree then click ANF Join button.

70 Now we need to join phenotype, population structure and genotype (Point) dataset together. First is to select these data from the data tree then click ANF Join button. A dataset is added to data tree with following name. three_traits_rn_rename + popstructure_taxa286_rn_rename + Point Then click on the dataset from the Data Tree panel, followed by the GLM button. The Input Data Definition dialog box will open. Use the drop down lists under Input Data Types to change Q1 and Q2 to Covariate and Q3 to Exclude as pictured in the screen shot below. Then click OK. Note that the Data Header labels have been filled in by the application. Header 1 is always the trait name. Header 2 is labeled Env (environment) and can be changed by the user. Additional headers, if they exist, must be assigned label names by the user. The Build a Linear Model dialog box will appear. Based on the factors chosen in the Data Definition dialog, TASSEL creates a default model, which appears in the Model box. To change which effects are included in the model, use the > or >> buttons to add selected factors or all factors from the Available Factors list. Use the < or << buttons to remove selected factors or all factors from the model. Select factors in the Available Factors list then click * to add an interaction term to the model. For this example, do not make any changes to the default model. Click Define Output. 64

The Define Output dialog box appears. TASSEL provides a default F-test for markers using the model residual as the error. That is appropriate in this case.

71 The Define Output dialog box appears. TASSEL provides a default F-test for markers using the model residual as the error. That is appropriate in this case. Type 1000 in the # permutations column of the Define F-Tests table to run permutations permutations will estimate p-values of 0.05 or higher with reasonable accuracy. For estimating lower p-values use more permutations. Optionally, additional F-tests can be added by selected a hypothesis and error term then clicking Add F-Test. To create a report of marker mean estimates, select Marker in the hypothesis list and click Add LSMean. The dialog box should now appear as shown below. Click Run. The results GLM_Point + 3 traits/environ + 3 element Q and Lsmeans_Marker_Point + 3 traits/environ + 3 element Q will be added to the Data Tree panel under Association. Viewing Results - Table: To view the results of this association analysis, click on the Results button in the upper left of the screen. Select the association result GLM_ three_traits_rn_rename + popstructure_taxa286_rn_rename + Point and then click Table. F_marker is the F-test calculated as MS_marker / MS_error. p-value is probability of a larger F based on the F-distribution. #-permutations is the number of permutations run. p-perm_marker is the permutation based test for marker significance individual markers. padj_marker is the experiment-wise p-value for marker significance that controls the error-rate for the entire set of hypotheses. Rsq Model is the fraction of the total variation explained by the full model. Rsq_Marker is the fraction of the total variation explained by the marker after fitting the other model effects. 65

5.6 Association analysis using MLM Here we illustrate association mapping by MLM with the following model: Trait = marker effect + population structure + polygene effect + residual To begin,

72 5.6 Association analysis using MLM Here we illustrate association mapping by MLM with the following model: Trait = marker effect + population structure + polygene effect + residual To begin, phenotype data, SNP data, population structure and kinship are loaded into TASSEL from the following files: Phenotype: three_traits_rn_rename.txt Marker: d8coding_rn_rename.txt Population structure: popstructure_taxa286_rn_rename.txt Kinship: kinship_277taxa_by_spagedi_rn_sq_rename.txt After extraction of SNP from the sequence file, we join it with phenotype, and population structure dataset. Caution: Do not join kinship. Select the joint dataset and kinship dataset, click MLM button in Analysis mode. The Input data definition dialog box will pop-up. Use the drop down lists under Input Data Types to keep dpoll as Data and change stiff-stalk and nonstiff to Covariate and exclude the others as seen in the screen shot below. Then Click OK. 66

These factors can be displayed or removed from on the right side (factors in the model)

73 After the OK button is clicked, the following dialog box is shown to list all the available factors (on left side). These factors can be displayed or removed from on the right side (factors in the model) by clicking the buttons in the middle. After the Run button is clicked, the Heritability dialog box is shown for user to choose different options. 67

The computation will be finished after a few seconds for this example.

74 In case the Use prior heritability given above is not checked, the following dialog box is shown for defining the parameters used for iterations. The computation will be finished after a few seconds for this example. The computing time manly depends on number of taxas, number of traits and markers. Association mapping results and estimates of variance components will be put on the Datatree shown below in the same order 68

to guide the user through scenarios involving genotype data.

75 The traits with squares mean none converge. 5.7 Working with genotype data For the purposes of this tutorial, we created a demonstration dataset to guide the user through scenarios involving genotype data. The dataset includes SSR genotype data (diploid_ssr.txt), phenotype data (diploid_traits.txt) and population structure data (diploid_stru.txt). These files are loaded into TASSEL as follows: 69

To do this, first click on the Genotype button (Data Genotype) and then click on the Genotype Converter button.

76 The example below demonstrates the application of GLM for association analysis between SSR genotypes and phenotypes. The first step involves using the Genotype Converter to transfer genotypes from raw form into genotypic state. To do this, first click on the Genotype button (Data Genotype) and then click on the Genotype Converter button. In the Genotype Converter window, activate the Create alignment based on genotypic state (eg. A:a>Aa) radio button as follows: Clicking OK will add the dataset GenoStates to the Data Tree panel. 70

button in the Options panel. Then click on the GLM button to show the following dialog box.

77 Now GenoStates can be joined with phenotype and structure datasets. To continue with the analysis, switch from Data mode into Analysis mode by clicking on the Analysis button in the Options panel. Then click on the GLM button to show the following dialog box. In the Input Data Definition window, use the drop down boxes to include all the phenotypic traits, take Q1 and Q2 as covariates and exclude Q3. 71

78 Then click the OK button to define the model as follows: Click the Run button to add the association results to the Data Tree. 72

5.8 Data visualization TASSEL is equipped with many visualization tools so that the user can better interpret raw data, manipulated data and analysis results. 5.8.1 2D Plot To create a 2D Plot, first click on the Results button in the Options panel.

In the 2-D chart dialogue box that opens, use the drop down lists to select sources for Row, Column and Value. Then activate the check boxes next to Row, Column and Value.

79 5.8 Data visualization TASSEL is equipped with many visualization tools so that the user can better interpret raw data, manipulated data and analysis results D Plot To create a 2D Plot, first click on the Results button in the Options panel. Select LogisticRat_5T_18 under Association from the Data Tree panel and then click on the 2D plot button. In the 2-D chart dialogue box that opens, use the drop down lists to select sources for Row, Column and Value. Then activate the check boxes next to Row, Column and Value. This example used Row-Environment, Column-Site, and Value-PermuteP to get the graph below: XY Plot To create an XY plot, click on the Analysis button in the upper left of the screen. Select d8coding_rn_rename (raw gene) in the Data Tree panel and then click on the Diversity button. In 73

To visualize these results in an XY plot, click on the Results button in the Options panel.

80 the Diversity Surveys dialog box, activate the Sliding Window checkbox. Clicking the Run button will add two datasets to the Data Tree panel under Diversity. To visualize these results in an XY plot, click on the Results button in the Options panel. Select Diversity: d8coding_rn_rename from the Data Tree panel and then click on the Chart button. Using the Graph Type drop down list, select XY Plot. Then select Pi for the Y1 axis, Theta for the Y2 axis, and Site for the X axis. A Site vs. Pi and Theta graph will be generated as shown below. 74

5.8.3 LD Plot To create an LD plot, first click on the Analysis button in the Option panel. Select Point from the Data Tree panel (under Genes ) and then click the Link. Diseq button.

81 5.8.3 LD Plot To create an LD plot, first click on the Analysis button in the Option panel. Select Point from the Data Tree panel (under Genes ) and then click the Link. Diseq button. Click Run in the dialog box that appears. The result LD: Point will be added to the Data Tree panel under LD. To visualize these results using a LD plot, click on the Results button in the Options panel. Select LD: Point from the Data Tree panel and then click on the LD Plot button. The following graph is displayed. 75

Genetic Analysis. Page 1

Genetic Analysis. Page 1 Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced