Genetic Analysis. Page 1

Similar documents
Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Relatedness and Association Mapping Contents

Step-by-Step Guide to Advanced Genetic Analysis

Release Notes. JMP Genomics. Version 4.0

JMP Genomics. Release Notes. Version 6.0

Polymorphism and Variant Analysis Lab

Importing and Merging Data Tutorial

JMP Clinical. Getting Started with. JMP Clinical. Version 3.1

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

Spotter Documentation Version 0.5, Released 4/12/2010

PRSice: Polygenic Risk Score software - Vignette

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

JMP 12.1 Quick Reference Windows and Macintosh Keyboard Shortcuts

GenomeStudio Software Release Notes

Getting Started with JMP at ISU

MAGA: Meta-Analysis of Gene-level Associations

Package lodgwas. R topics documented: November 30, Type Package

QTX. Tutorial for. by Kim M.Chmielewicz Kenneth F. Manly. Software for genetic mapping of Mendelian markers and quantitative trait loci.

Create & Edit a Question

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.4 Graphical User Interface (GUI) Manual

SNS Vibe Data Processor

ABI PRISM GeneMapper Software Version 3.0 SNP Genotyping

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Tutorial. Identification of Variants Using GATK. Sample to Insight. November 21, 2017

QDA Miner. Addendum v2.0

haplo.score Score Tests for Association of Traits with Haplotypes when Linkage Phase is Ambiguous

User Manual ixora: Exact haplotype inferencing and trait association

Piping & Instrumentation Diagrams

Tutorial 3. Correlated Random Hydraulic Conductivity Field

Insight: Measurement Tool. User Guide

SEEK User Manual. Introduction

Agilent Genomic Workbench 7.0

Tutorial: Resequencing Analysis using Tracks

Forensic Resource/Reference On Genetics knowledge base: FROG-kb User s Manual. Updated June, 2017

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data

Bioinformatics - Homework 1 Q&A style

Devyser QF-PCR. Guide to Sample Runs, Data Analysis & Results Interpretation

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010

LD vignette Measures of linkage disequilibrium

Recalling Genotypes with BEAGLECALL Tutorial

A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals

Contents. CRITERION Vantage 3 Analysis Training Manual. Introduction 1. Basic Functionality of CRITERION Analysis 5. Charts and Reports 17

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Tutorial 01 Quick Start Tutorial

User Services Spring 2008 OBJECTIVES Introduction Getting Help Instructors

LEGENDplex Data Analysis Software Version 8 User Guide

Performing a resequencing assembly

LINAX Series Videographic Recorders

GWAsimulator: A rapid whole-genome simulation program

GenViewer Tutorial / Manual

KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

QTL Analysis with QGene Tutorial

WEST TEXAS A&M UNIVERSITY

Tutorial: De Novo Assembly of Paired Data

Download PLINK from

STEM. Short Time-series Expression Miner (v1.1) User Manual

Tutorial. De Novo Assembly of Paired Data. Sample to Insight. November 21, 2017

Workflow Guide Slide(s) Topic 2-6 Importing Data and Labeling Samples 7-11 Processing Data Without an Allelic Ladder Processing Data With an

Convert Dosages to Genotypes Author: Autumn Laughbaum, Golden Helix, Inc.

Opening a Data File in SPSS. Defining Variables in SPSS

Bombardier Business Aircraft Customer Services. Technical Publications. SmartPubs Viewer 3.0 User Guide. Updated January 2013 [2013]

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

Creating and Using Genome Assemblies Tutorial

Working with Macros. Creating a Macro

Definiens. Tissue Studio 4.4. Tutorial 4: Manual ROI Selection and Marker Area Detection

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Quick Guide for Excel 2015 Data Management November 2015 Training:

The viewer makes it easy to view and collaborate on virtually any file, including Microsoft Office documents, PDFs, CAD drawings, and image files.

The Preparing for Success Online Mapping Tool

Using the Spectrum Management Tools

GeoWeb Portal. User Manual

Getting Started with JMP Clinical

lab MS Excel 2010 active cell

1. AUTO CORRECT. To auto correct a text in MS Word the text manipulation includes following step.

Differential Expression Analysis at PATRIC

MAN Package for pedigree analysis. Contents.

Flow Cytometry Analysis Software. Developed by scientists, for scientists. User Manual. Version Introduction:

To complete this database, you will need the following file:

MY MEDIASITE.

BD CellQuest Pro Analysis Tutorial

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 2 Working with data in Excel and exporting to JMP Introduction

m6aviewer Version Documentation

Contact Center Advisor. Genesys Performance Management Advisor TM. User Manual Release 3.3

Quick Start Guide. ARIS Architect. Version 9.8 Service Release 2

QGIS LAB SERIES GST 102: Spatial Analysis Lab 3: Advanced Attributes and Spatial Queries for Data Exploration

Separate Text Across Cells The Convert Text to Columns Wizard can help you to divide the text into columns separated with specific symbols.

HVAC Diagrams Preface What's New? Getting Started User Tasks

3.2 Circle Charts Line Charts Gantt Chart Inserting Gantt charts Adjusting the date section...

VIEWZ 1.3 USER MANUAL

Genetic type 1 Error Calculator (GEC)

Creating a data file and entering data

OneView. User s Guide

EZCT-2000 Software. VERSION 2.x USER'S MANUAL

Office Excel. Charts

User Guide. Web Intelligence Rich Client. Business Objects 4.1

MAGMA manual (version 1.06)

DataPro Quick Start Guide

Transcription:

Genetic Analysis Page 1

Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced genetic analysis techniques JMP Genomics has a number of tools for the analysis of SNP or other genetic marker data that we will examine in this chapter. The examples shown here all use a data set collected during a study of diabetes. Ask your instructor where to locate the files dbts_data.sas7bdat and dbts_anno.sas7bdat. The dbts_data file contains information on 363 individuals: 181 patients with diabetes (cases, coded as 1 in the column named Diabetes) and 182 persons without diabetes (controls, coded as 2 ). Each row in the data set corresponds to an individual, and each column contains genotype data for one of the 2200 SNPs in the MHC region of chromosome 6. Note that each column contains a genotype with alleles delimited by a slash, rather than a single allele. Page 2

SNP annotation for these 2200 SNP markers is found in the dbts_anno.sas7bdat file. There are seven variables describing the SNPs, including chromosome number and location. Note that annotation files can contain any arbitrary columns of information describing the SNPs. The data file and annotation file follow the JMP Genomics convention that the columns of SNP genotypes in the data file occur in the same order as the rows of SNP annotation in the annotation file. Case-Control Association The Case-Control Association process is used for association mapping of a binary trait using genetic marker data. Any binary trait may be used, typically with one value representing cases and another representing controls. Derived p-values for the association are plotted against the physical location of specific markers to reveal regions of significance. This method assumes that individuals are unrelated in recent generations. 1) Select Genetics > Association Testing > Case-Control Association from the Genomics Starter. 2) Click the Choose button next to the Input Data Set box, and navigate to the dbts_data.sas7bdat file. A message window will appear, warning that only the first 5000 variable names will be available for point-and-click selection. This means that in large data sets, the specification of marker variables for analysis must be made using list-style specification rather than by point-and-click. Page 3

3) Choose the Diabetes variable from the Available Variables list, and move it into the Trait Variables box using the arrow button. 4) In the List-Style specification of Marker Variables box, type rs:. (Do not include the quotes.) This syntax is a way to include all variables that start with the letters rs. Other SAS-style variable lists are valid here as well. Click the? button for further information. 5) Click the Choose button next to the Output Folder box to specify a location to save the output from this procedure. The completed General tab should look like this: 6) Click on the Annotation tab at the top of the dialog. 7) Click the Choose button next to the Annotation Data Set box and navigate to the dbts_anno.sas7bdat file. 8) Click on the variable named SNP in the Available Variables list. Using the arrow buttons, place it into three of the boxes on the right: Annotation Label Variable, Marker Names Variable, and Accession Number Variable. The Marker Names Variable box is optional, and consistent use will result in longer run times. Remember that the order of rows in the annotation data set must match the order of marker columns in the data file. To Page 4

check that these files are ordered consistently, you can specify the name of the variable in the annotation data set that contains the marker column names in the Marker Names Variable box. If a discrepancy is found between the files, an alert message will be displayed. 9) Place the Chr variable into the Annotation Group Variable box. 10) Place the Position variable into the Annotation Location Variable box. The completed Annotation tab should look like this: 11) Click on the Options tab at the top of the dialog. 12) Click on the Genotypes radio button in the Format of Marker Variables section Genotypes may be specified in different ways. The standard genotype format has the two alleles separated by a delimiter character. 13) In the Genotype Delimiter box, type the forward slash character /. 14) Check the All Markers are Biallelic checkbox to allow the use of a faster computation algorithm. 15) In the Association Tests box, select the Allele, Genotype and Trend tests. Multiple tests may be specified here by pressing and holding either the <Ctrl> or <Shift> while making selections. 16) Select Calculate allele odds ratios This will calculate the odds ratio for each biallelic marker. The Allele test must be selected. 17) Type 0 in the Permutations box. Page 5

The completed Options tab should look like this: 18) Click on the P-Value Plots tab at the top of the dialog. 19) In the Conversion for P-Values selection box, choose NegLog10. 20) In the Multiple Testing Correction selection box, choose FDR (Benjamini and Hochberg). 21) Type 0.05 in the Alpha box. This combination of choices sets the false discovery rate threshold at 5%. Page 6

22) Click the Run button at the bottom left of the dialog to start the process. A plot of the p-values from each test versus the position along chromosome 6 is displayed when the process is complete: The peak in this plot indicates a locus of association for diabetes. The plot and its related data table dbts_data_cca are now available for interactive analysis using JMP tools. A few examples of interactive analysis are now described. Page 7

23) Select the highest point (rs9272723), either by clicking on it or drawing a box around it. Click the dbsnp button to launch a web browser and display the dbsnp page for this SNP. 24) Open the data filter dialog from Rows > Data Filter. 25) Select the variable NegLog10_ProbTrend_FDR and click the Add button. 26) Click on the 0 at the left side of the data filter, and type the value 1.3 to change the lower bound. Alternatively, use the left-hand slider until the lower bound is approximately 1.3. The value 1.3 is the -log 10 transformation of the value.05. The 394 SNPs satisfying this criterion are significant at a false discovery rate threshold of 5%. If you select the Show box, only those SNPs on the plot whose lower value is greater than 1.3 will show on the plot. If you select any of the database links from the Action Buttons on the dashboard, the information for each SNP will open in your web browser. Caution! This process will create an individual web page for each SNP that is selected with either the data filter, or by drawing a box with your mouse around the desired SNPs. 27) Select Tables > Subset. In the Rows section, choose Selected Rows. In the Columns section, choose All columns. Click OK. Page 8

A new table named Subset of dbts_data_cca is displayed, containing the 394 rows that were selected using the Data Filter. Page 9

Basic Genetics Workflow Workflows in JMP Genomics consist of a series of analytic processes (APs) performed in sequence. Results are accessible via links in a JMP journal. The Basic Genetics Workflow is an easy-to-use tool that is primarily intended for users that are new to JMP Genomics. It presents a simplified interface for SNP and sample quality assessment and case-control association analysis; using APs that have been previously described (Marker Properties, Subset and Reorder, Case-Control Association). Many of the choices available in the dialogs for the individual APs have been hidden in the Basic Genetics Workflow to streamline the analysis. After running this workflow with the default options enabled, experienced users may load and modify settings for individual processes within the Basic Genetics Workflow, or design their own custom workflow consisting of any sequence of JMP Genomics APs. 1) Select Workflows > Basic Genetics Workflow from the Genomics Starter. 2) Type Diabetes in the Study Name box. 3) Click the Choose button next to the Input Data Set box to select the diabetes SNP data set dbts_data.sas7bdat. 4) Type rs in the Prefix of Marker Genotype Variables box. Note that since this field is expecting a prefix, you do not need to type the colon after the prefix. 5) Select Diabetes from the list of available variables, and place it into the Binary Trait Variable box using the arrow button. 6) Choose an output folder to store the results from the workflow. The completed General tab should look like this: Page 10

7) Click the Annotation tab at the top of the dialog. 8) Using the Choose button, navigate to and select the dbts_anno.sas7bdat annotation data set. 9) From the Available Variables list, choose SNP and place it in three positions using the arrow buttons: Variable Containing Names of Marker Variables, Annotation Label Variable, and Accession Number Variable. 10) Similarly, place the Chr variable into the Annotation Group Variable box and the Position variable into the Annotation Location Variable box. The completed Annotation tab should look like this: Page 11

11) On the Subsetting tab, add 2 (with the quotes) to the Trait Value of Individuals to Include in HWE Test field to perform the test for Hardy-Weinberg equilibrium on the Control group alone. 12) Make no changes to the defaults on the Filtering tab. 13) Click Run. After the individual APs finish running, a journal containing the results from the workflow is displayed. Page 12

This journal is automatically saved in the output folder in the file diabetes.jrn. Opening this journal file in JMP Genomics will allow you to review results for individual processes in the workflow. 14) Click on the Results button for each of the four processes in turn to review each analysis. When you are finished with each process, return to the journal and click the Close All Other Windows button to clear the screen. 15) To modify elements of the workflow, click the Reopen Dialog button under each process. The process window will open and be filled in with the data sets used and the other settings. 16) Click on the Reopen Dialog button under Process 4. The Case Control Association AP dialog is launched, with the dialog options filled in as they were run in the workflow. Now we will make a change to the analysis, and save it back to the workflow. 17) Click on the Options tab, and in the Association Tests box choose both Genotype and Trend (you can select multiple tests by holding down the Control key while clicking). Page 13

18) Click the Save button at the bottom of the Case-Control Association dialog. If necessary, highlight diabetes, click on Update Setting Name. 19) Click OK 20) Close the Case-Control Association dialog, and return to the Workflow Builder. Click Run to re-run the workflow with the new settings. 21) When the workflow is complete, re-open the Case-Control Association results to verify that both the Genotype and Trend tests were run. SNP-Trait Association Case-Control Association is limited to analysis of a binary trait without covariates. For more complex models or data structures, there are two general and flexible tools for association analysis: the SNP-Trait Association AP and the Marker-Trait Association AP. As the names imply, SNP-Trait Association is intended for the analysis of biallelic marker data only; Marker-Trait Association can accommodate multi-allelic markers. The interfaces for these two tools are virtually identical. SNP-Trait Association has been optimized for biallelic markers, so SNP analyses will run faster in SNP-Trait than Page 14

in Marker-Trait. To demonstrate SNP-Trait Association, we will repeat the association analysis of the diabetes data set, this time adding sex and age as covariates, and an interaction term between sex and each SNP analyzed. 1) Select Genetics > Association Testing> SNP-Trait Association from the Genomics Starter. 2) For the Input Data Set, select the dbts_data_sr2.sas7bdat diabetes marker data file from which the SNPs failing the quality control screen in the Basic Genetics Workflow were removed. (Look for it in the output folder from the previous analysis.) 3) Select Diabetes from the Available Variables list, and place it into the Trait Variables box using the arrow button. 4) Type rs: into the List-Style Specification of SNP Variables box 5) Click the Choose button next to the Output Folder box, and designate a folder for the output from this procedure. The completed General tab should look like this: 6) Click on the Model Variables tab. 7) In Type of Trait selection box, choose Binary. The diabetes variable has only two levels, case and control, so it is a binary trait. This AP can also accommodate continuous traits, nominal traits (with more than two levels), ordinal traits, and censored survival time traits. If a censored survival time is the trait to be analyzed, specify Page 15

Survival as the type of trait and place the name of the variable that is a censoring indicator in the Censor Variable box. We wish to designate sex and age as covariates in this analysis. Any categorical variable that is used as either a fixed or random covariate must be designated as categorical by placing it in the Class Variables box. 8) Choose Sex from the Available Variables list, and place it into the Class Variables box. 9) Type Sex Age in the Fixed Effects box. 10) Type Sex in the Interaction Effects box. This creates an interaction of the effect Sex with the current SNP being analyzed. So, for example, for SNP1, the fixed effects included in the model would be Sex, Age, SNP1, and Sex-by-SNP1. In general, this should not be performed when working with a GWAS data set since a separate model is run with and without the interaction term. The completed Model Variables tab should look like this: 11) Click on the Annotation tab. 12) Click on the Choose button next to the Annotation Data Set box, and navigate to the dbts_anno_sr2.sas7bdat annotation file that corresponds to the dbts_data_sr2 input marker data file. Note: It will be found in the same folder. Remember that the order of rows in the annotation data set must match the order of marker columns in the data file. To check that these files are ordered consistently, you can specify the name of the variable in the Page 16

annotation data set that contains the marker column names in the Marker Names Variable box. If a discrepancy is found between the files, an alert message will be displayed. 13) Select the proper roles for the annotation variables using the Available Variables list and the arrow keys, duplicating the selections that were made previously in the Case-Control Association analysis. The completed Annotation tab should look like this: 14) Click on the Options tab. 15) Select Genotypes as the Format of SNP Variables. 16) Type / in the Genotype Delimiter box. 17) Control-click to choose both Genotype and Trend tests for association. The completed Options tab should look like this: Page 17

18) Click on the P-Value Plots tab. Make no changes to the default selections. 19) Click Run. The results from the association tests are presented in two plots. The top panel contains results from the genotype main effects, and the bottom panel contains results from the genotype by sex interactions. Page 18

Significant main effects indicate association between the marker and the trait; significant interactions indicate that the effect of the SNP genotype on disease status is different for males and for females. Parameter Estimates from the genotype and trend models are contained in output data sets, which are available for inspection and further analysis. Linkage Disequilibrium Different measures of Linkage Disequilibrium (LD) can be computed for specific markers and/or regions of interest. Because LD is a measure computed between pairs of markers, the number of potential LD measurements in a data set grows rapidly (quadratically) as the number of markers measured increases. It is therefore important to restrict the number of LD calculations made, and a number of methods for doing this will be described. Here, we will focus on a region of association identified in the prior analyses. 1) Select Genetic Marker Statistics > Linkage Disequilibrium from the Genomics Starter. Page 19

Once again this example uses the diabetes data set. Use the same marker data set and annotation data set that were selected for the SNP- Trait Association AP previously. 2) The completed General tab should look like this: The filter statement Diabetes = 1 restricts the LD analysis to individuals with diabetes only. 3) The completed Annotation tab should look like this: Page 20

The Filter to Include Markers box can be used for restricting the analysis according to values of variables present in the annotation file; here we use it to restrict the LD analysis to markers that fall between positions 32575000 and 32922100, using the le syntax for less than or equal to, ( synonymous with <= ). Other criteria many be added. Click the? for more information on allowable syntax. 4) The completed Options tab should look like this: Page 21

There are two additional options for controlling which marker pairs are used in LD calculations. The checkbox Perform LD Calculations for All Pairs within Annotation Groups refers to the Annotation Group variable on the previous page. In this case, if the box is checked, LD will be calculated between all marker pairs within chromosomes. If the All Pairs checkbox is unchecked, then LD is calculated between all marker pairs within a sliding positional window. The radio buttons for Distance Unit control whether the size of the sliding window is computed based on the number of adjacent markers, or physical distance as measured by the Location Variable specified on the Annotation tab. Here, we are calculating LD between markers that are no more than 50 markers apart. Several options for estimation of haplotype frequencies are available see the documentation for PROC ALLELE in SAS/Genetics for details. Check the Suppress all graphical and HTML output checkbox to improve performance when a large number of LD measures are to be calculated. The large output tables can then be filtered using JMP selection tools to focus the analysis further. 5) The completed Output tab should look like this: Page 22

6) Click Run. Several plots are displayed for each annotation group (or for the entire group of markers). Examine the LD Decay tab in the Linkage Disequilibrium Results tabbed report. The LD Decay over Distance plot shows each pair of markers as one point on a scatterplot, with the chosen LD measure on the vertical axis and distance between the pair on the horizontal axis. Page 23

Examine the All Marker Plots tab. A triangular plot shows p-values from the LD tests, color-coded by their degree of significance. The horizontal and vertical axes represent the positions of markers, as specified in the annotation data set. The points along the diagonal show the results of the Hardy-Weinberg Disequilibrium test on a single marker, while points away from the diagonal represent LD tests between pairs of markers. Below the triangular plot, a contour plot displays regions with similar LD values. Page 24

7) Click on the All Markers button in the Action Buttons section of the tabbed report. A fourth plot is generated, with R 2 LD measures between pairs of markers displayed according to their position along the chromosome. 8) Click within the plot to define a block of markers. Page 25

The Zoom on Selected Block button will allow you to create a subset plot, which can in turn be explored. Use the Retain Selected Block button to define multiple blocks within a plot. The Copy to Journal button will save the plot with any blocks that have been retained to the output journal for later reference. Haplotype Analysis Several haplotype analysis tools are available in JMP Genomics. We will work through an example using the Haplotype Estimation and htsnp Selection procedures. Because haplotype analysis is computationally intensive and produces voluminous output, the haplotype procedures should not be run on a whole-genome basis. It is important to reduce the analysis to areas of interest. At most, a single chromosome at a time should be run. In order to estimate haplotypes for a subset of the data, use the Filter to Include Markers setting on the Annotation tab. 1) Select Genetics > Haplotype Analysis > Haplotype Estimation from the Genomics Starter 2) The completed General tab should look like this: Page 26

3) Complete the Annotation tab to match the settings shown in the figure below: Page 27

The Filter to Include Markers box is used to restrict the analysis to a smaller region of chromosome 6. The syntax for this filter is specified as 32000000 le Position le 33500000 If you wish to manually specify groups of SNPs for haplotype estimation, create a categorical variable in the annotation file to place the markers into haplotype groups. This variable can then be specified in the Sliding Window Variable box. The LD Block Creation AP can be used to create these values in the annotation table. 4) Do not change the default settings on the Options tab. Settings for this tab should look like this: If a haplotype window variable was not specified in the previous tab, use the top three inputs to determine the way haplotypes are calculated. Here, consecutive sets of 5 SNPs along the chromosome will be used to define haplotype windows. Haplotype Sliding Window Size can be defined either by the number of consecutive markers, or by the distance units used in the Annotation Location Variable specified on the previous tab. If a 0 is placed in this box, then haplotypes using all markers within the grouping variable will be estimated. The Window Overlap box controls how many markers are shared between adjacent haplotype windows when the Window Size Unit is set to Markers. Page 28

For more details about the other options controlling calculation of haplotype frequency estimates, see the JMP Genomics User Guide and the PROC HAPLOTYPE section in the SAS/Genetics documentation. 5) The Association Tests tab should look like this: The Test Allelic Association (LD) checkbox enables testing for linkage disequilibrium across each haplotype window. Specifying a binary trait variable in the Trait Variable box activates casecontrol testing for association with that trait over all the haplotypes within each window. Checking the Test Individual Haplotypes box requests individual association tests for each estimated haplotype within each window with the trait. In this case, we will perform a Bonferroni Multiple Testing Correction. 6) The completed Output tab should look like this: Page 29

There are two optional output tables that can be requested on this tab. The haplotype frequency table contains a row for each possible haplotype within each window, and its estimated frequency. This table is required when performing follow-up analyses such as haplotype tag SNP selection and haplotype trend regression. The much larger phase assignment table contains a row for each possible pair of parental haplotypes for each individual in each window, along with their probabilities. This table is required input for the haplotype trend regression follow-up analysis. The Frequency Cutoff and Phase Assignment Probability Cutoff options allow the removal of rare or improbable haplotypes or haplotype phase pairs respectively. Place 0.05 in each of these boxes. 7) Click Run. When the procedure finishes running, it will display the HTML output from the Haplotype procedure in SAS/Genetics, as well as an overlay plot of the significance of the haplotype LD test and the haplotype-trait association test for each window in the analysis. Page 30

Points on this graph represent haplotype windows consisting of five adjacent markers each. The blue points show that all the windows are in highly significant linkage disequilibrium, due to the close proximity between adjacent SNPs. The red points show association tests between each window and the binary diabetes status trait, and there is a region of significant association in the center. 8) To further investigate this region, click on the htsnp Selection button under the Launch Follow-Up Process section of the tabbed report. This will launch the htsnp Selection AP and automatically load the relevant data and annotation files. Page 31

9) On the General tab of the htsnp Selection dialog, type 55 le window le 55 in the Filter to Include Observations box, to select only the haplotype windows that were located in the central peak of association with diabetes. The Frequency Variable is autoloaded with the H1Freq variable. This is the estimated frequency which was calculated in the previous step of haplotype estimation. For statistical details about the Criterion for Evaluating Sets of htsnps and the Search Method options, consult the PROC HTSNP section of the SAS/Genetics documentation. 10) Click on the Annotation tab, and view the selected defaults. These do not need to be changed. 11) Click on the Options tab. 12) Set the Subset Size to 2. This option will select 2 of the 5 SNPs in each window to serve as tag SNPs. 13) Set the Number of Selections to Display to 3. This will request the 3 best sets of tag SNPs to be displayed for each window. 14) Select Create a SAS set containing the htsnp indicator variable. Page 32

This data set could be used in subsequent analysis. Any tagged SNP will have a value of 1 and untagged SNPs within an analyzed group will have a value of 0. The completed Options tab should look like this: The SimAnneal Search tab contains options only relevant if Simulated Annealing was chosen as the Search Method on the General tab; if not, the options on this tab are not available for input. 15) Click Run. The proportion of diversity explained by each set of tag SNPs is plotted by location for each of the chromosomes. Every marker position in a haplotype window is indicated by a vertical gray, dashed line. A different color is used for each haplotype window, and a different symbol is used for each set of tag SNPs in that haplotype window. In this case, we see the three pairs of tag SNPs that had the highest proportion of diversity explained for each haplotype window. The best pair is marked with open circles, and the next best is marked with plus symbols. To see more sets, increase the value of the Number of Selections to Display setting on the Options tab. Page 33

Conclusion In this portion of the training, we have walked through some of the common functions available for genetic marker analysis in JMP Genomics. There are many additional APs, including PCA for Population Stratification and Q-K Mixed-Model. The dialogs for these and other genetic marker analysis processes are similar in design to those we have already reviewed here. It is often helpful to read the Description field at the top of each window to learn the specific requirements for each process, and detailed documentation for each process is available by clicking on the User Guide button at the bottom of each process dialog. Page 34