Step-by-Step Guide to Advanced Genetic Analysis

Size: px

Start display at page:

Download "Step-by-Step Guide to Advanced Genetic Analysis"

Damian Fletcher
6 years ago
Views:

1 Step-by-Step Guide to Advanced Genetic Analysis Page 1

2 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options available in the software. These analytical processes are not necessarily more difficult to use; however, they are more specialized and may require a greater understanding of the underlying statistics than do the more commonly used processes. We do not review statistical theory in this document the literature citations available in the online User Guide do this quite well. The goal is rather to provide an explanation of how to use the processes described below and interpret their results. Objectives In this document we will cover the following processes: Recode Genotypes, which recodes allelic/genotypic values to a variety of formats. Relationship Matrix for calculating identity by state. PCA for Population Stratification, which uses the Eigenstrat method to describe and account for population structure in marker data. Multiple SNP-Trait Association, which uses a gene or other organizational level to analyze multiple SNPs. Pleiotropic Association for MANOVA analyses. Rare-Variant Analysis, which accommodates a number of rare variant analytic approaches. Recode Genotypes The Recode Genotypes process changes allele/genotype formats to numeric formats or numeric formats to the A/B-style format. In some instances, having data in the numeric format can speed processing of large data sets and is a required format for a number of processes in this training module. Note that genotypes/alleles can be recoded into the default Numeric Additive format as a part of the Marker Properties process, as described in the Basic Genetic Analysis document. If you need or desire data in other numeric formats for further analysis, we recommend that you recode the data using the Recode Genotypes AP. 1) Select Genomics > Genetics Utilities > Recode Genotypes from the Genomics Starter menu. 1 You should review the Step-by-Step Guide to Basic Genetic Analysis guide before working through the examples described here. Page 2

2) Choose the ord1_geno_data_sr.sas7bdat data set, generated as described in the Basic Genetic Analysis module, as the Input SAS Data Set. 3) Select the Marker Variables as done previously.

3 2) Choose the ord1_geno_data_sr.sas7bdat data set, generated as described in the Basic Genetic Analysis module, as the Input SAS Data Set. 3) Select the Marker Variables as done previously. In this example, type rs: seq: in the List-Style Specification of Marker Variables text box. 4) Choose the Output Folder. The completed General tab is shown below: 5) Select the Recode Tab. 6) Complete the Recode Tab as shown below: Page 3

Clicking on the? next to the Genotype Recoding section will bring up the recoding scheme for each option. We will be adding r_ to the start of each column name of the recoded genotype.

8) Choose the data set, ord1_geno_data_hwe_sr.sas7bdat for the Annotation SAS Data Set. 9) Select ID as the Annotation Label Variable and MajorAllele as the Annotation Major Allele Variable.

4 Clicking on the? next to the Genotype Recoding section will bring up the recoding scheme for each option. We will be adding r_ to the start of each column name of the recoded genotype. A new, corresponding column will be added to the annotation file. We are optionally naming the output data set, rec_num_data. 7) Select the Annotation tab. 8) Choose the data set, ord1_geno_data_hwe_sr.sas7bdat for the Annotation SAS Data Set. 9) Select ID as the Annotation Label Variable and MajorAllele as the Annotation Major Allele Variable. Identifying the major allele variable, calculated earlier with Marker Properties, will speed up this process. Keep in mind that the selection of a major allele using Marker Properties will be data-set dependent. 10) Type rec_anno for the Output Annotation Data Set. 11) Click Run to start the process. Page 4

Two data sets, the rec_num_data.sas7bdat output data set and the rec_anno.sas7bdat annotation data set are generated. We us these data sets in the next process.

5 Two data sets, the rec_num_data.sas7bdat output data set and the rec_anno.sas7bdat annotation data set are generated. We us these data sets in the next process. Relationship Matrix The Relationship Matrix AP allows you to perform Identity by State (IBS), Identity by Descent (IBD) and Allele Sharing calculations. These data are from unrelated individuals, and so we will use the IBS option. 1) Select Genetics > Relatedness Measures > Relationship Matrix from the Genomics Starter menu. 2) Choose the rec_num_data.sas7bdat data set as the Input SAS Data Set. 3) Complete the General tab as shown below: 4) Select the Annotation tab. 5) Choose the rec_anno.sas7bdat data set as the Annotation SAS Data Set as shown below: Page 5

The Report Sample Pairs options will generate a table and corresponding graph from pairs of individuals with relatedness scores above this threshold.

6 6) Select the Analysis tab. 7) Complete the Analysis tab as shown below: Select Identity by State as the Relationship Matrix to Compute. The Compute the Root of the Matrix option generates a population structure dataset for later use in Q-K mixed model association analysis. The Report Sample Pairs options will generate a table and corresponding graph from pairs of individuals with relatedness scores above this threshold. Principal Components Analysis will be performed on the relatedness matrix values. This can be useful for finding patterns or groups of related individuals. 8) Select the Options tab. 9) Check the Plot Relationship Matrix Heat Map check box. 10) Click Run to start the process. The results are shown below: Page 6

The heat map shows the hierarchical clustering relationship between all of the samples. It is a matrix of samples with the x- and y-axis comprised of the same samples.

7 The heat map shows the hierarchical clustering relationship between all of the samples. It is a matrix of samples with the x- and y-axis comprised of the same samples. The dark line running diagonally from the upper left to the lower right indicates the intersection of identical samples where the IBS values are all equal to 1. Samples will cluster based on the similarity of the IBS or distance score. In this instance, there is a group of samples on the lower right that form a distinct cluster separated from the remainder of the samples. There is no strong relationship in these samples as the IBS value is relatively low. Clustering algorithms will always create clusters, but the significance of observed clusters is dependent on the strength of the relationship and knowledge about the samples included in the experiment. The IBS Pairs tab shows the distribution of the IBS values and the underlying table (found by selecting View Data under the IBS Pairs pull-down menu at the upper left of the dashboard) shows the sample pairs within the distribution. As this data set included a limited number of markers, the interpretation of these results is difficult. Page 7

8 In a GWAS dataset, IBS values >0.98 often indicate twins or repeated measurements from the same sample. 11) Select the PCA 2D Row Scores tab. Each point represents a sample, and the graphs at the intersections at right angles from each distribution plot represent the samples in those two principal components. There are no obvious clusters in this data and the distribution of points in the three principal components does not appear to be heavily biased or bimodal. PCA for Population Stratification In a genetic data set, there is often unknown structure in the population. PCA for Population Stratification (Eigenstrat, Price et al., 2006) attempts to derive and correct for population structure through the use of principal components analysis. 1) Select Genetics > GWAS Testing > PCA for Population Stratification from the Genomics Starter menu. 2) Choose the rec_num_data.sas7bdat data set as the Input SAS Data Set. 3) Select PCT_CHG_APOC3 as the Trait Variable. 4) Complete the General tab as shown below: Page 8

9 5) Select the Annotation tab. 6) Choose the rec_anno.sas7bdat data set as the Annotation SAS Data Set. 7) Select ID as the Annotation Label Variable. 8) Select the Options tab. 9) Complete the Options tab as shown below: Page 9

The PCA Data Set field is used when PCA analysis has been previously run on the data. This can also be useful when changing the number of principal components used in the process.

10 The PCA Data Set field is used when PCA analysis has been previously run on the data. This can also be useful when changing the number of principal components used in the process. The Maximum Number of Principal Components and Cumulative Proportion values are used to limit how many components will be used in the correction. When either of these values is reached, no more principal components are added to the calculation. Create Merged Output PCA Data Set is useful when you may want to use principal components as covariates in a separate analysis. Eigencorr Options can be set to use statistical tests to choose significant principal components to use for adjustment. 12) Click Run to start the process. The results are identical to those previously shown for the SNP-Trait Association AP, with the following exception: a new Action Button, Plot Trait by Genotype appears on the dashboard. Note: If numeric markers are run in the SNP-Trait Association, this will appear in those results. 13) Select a few points from the Volcano Plot then click the Plot Trait by Genotype action button. Each point represents a sample, the numeric genotype is on the x-axis and the continuous trait value is on the y-axis. Multiple SNP-Trait Association The Multiple SNP-trait Association process tests the association between a group of SNPs and a trait. You have great flexibility when specifying sets of SNPs for this process. Some common grouping choices are genes, LD blocks and pathways. The setup is identical to the SNP-Trait Association process with the exception that an additional field has been added to the Annotation tab to designate the grouping variable. Page 10

11 1) Select Genetics > Other Association Testing > Multiple SNP-Trait Association from the Genomics Starter menu. 2) Complete the General tab and the Annotation tab as done for PCA for Population Stratification. Select Gene_Symbol as the Annotation Analysis Group Variable on the Annotation tab. 3) Select the Options tab. 4) Complete the Options tab as shown below: The majority of these options are specific for the test run. Exclude Single-SNP Genes should always be selected. That will save time in the analysis. Note that this pertains to the grouping variable, and if the group specified is different than a gene identifier, it will exclude single-snp groups of the specified type (e.g., single SNP LD blocks or pathways). 14) Click Run to start the process. The output is identical to that of SNP-Trait, with the exception that each point in the Manhattan Plot represents a gene, and there is no Volcano Plot option. Pleiotropic Association Pleiotropic association performs a MANOVA (Multivariate Analysis of Variance) test between two or more continuous traits and genetic marker data. The purpose of this Page 11

12 process is to test if two or more traits are linked to the same region of the genome. For example, it may be in selective breeding that a desirable phenotype (increase in milk production) is correlated with an undesirable effect (susceptibility to infection) and you would like to test if these two correlated events are linked genetically. Bear in mind that the more continuous traits tested, the more difficult it will be, in general, to find significant regions of association. 1) Select Genetics > Other Association Tests > Pleiotropic Association from the Genomics starter menu. 2) Choose the rec_num_data.sas7bdat data set as the Input SAS Data Set. 3) Select PCT_CHG_APOC3 and BMI as the Trait Variables. 4) Complete the General tab as shown below: 5) Make no changes to the default settings on the Model Variables tab. Note that random effects are not allowed in this model. 6) Select the Annotation tab. 7) Choose the rec_anno.sas7bdat data set as the Annotation SAS Data Set. 8) Select ID as the Annotation Label Variable. 9) Complete the Options tab as shown below: Page 12

The results are similar in format to the other genetics processes we have reviewed.

13 The MANOVA Statistic options perform significance tests with slightly different assumptions. The Roy Greatest Root is the least stringent test. 10) Click Run to start the process. The results are similar in format to the other genetics processes we have reviewed. The Manhattan Plot shows the MANOVA test for all traits for the genotype and trend test, and then the individually tested traits for the two association tests. There is a new Action Button, View Venn Diagram of Significant Markers 11) Click on the Genotype button for the Venn diagram. Page 13

14 Selecting a value with a mouse click (e.g., 10) will highlight those rows in the data table. 12) Select Tables > Subset from the JMP menu to view the markers associated with this value. Rare Variant Analyses Rare variant analyses are generally only used with next-gen sequencing variant call data, as SNP microarrays tend to be composed of markers pre-selected for a relatively high degree of heterozygosity. There are a number of methods available in JMP Genomics for rare-variant analyses, and all of them attempt in one way or another to take into account the gene (or other structure) in which the variant exists, rather than testing for association between each individual variant and the trait of interest. In this way, significant association to a phenotype can be attributed to a gene, even in a population with numerous different variant loci within that gene. We will examine only one type of analysis in this exercise, as the set-up is generally similar for all variations of the test. 1) Select Genetics > Other Association Testing > Rare Variant Tutorial from the Genomics starter menu. A list of the available tests pops up. Selecting any one of these will open the appropriate AP with the settings for that test. 2) Select the VT variant threshold method button. 3) Click on Rare Variant Association in the pop-up window. 4) Choose the rec_num_data.sas7bdat data set as the Input SAS Data Set. 5) Select RESP as the Trait Variable. 6) Complete the General tab as shown below: Page 14

15 7) Select the Annotation tab. 8) Choose the rec_anno.sas7bdat data set as the Annotation SAS Data Set. 9) Select ID as the Annotation Label Variable. 10) Select Gene_Symbol as the Annotation Analysis Group. 11) Fill out the Options tab as shown below: Page 15

A weight proportional to the inverse of the standard deviation of allele counts is being used in the calculation. If there were other weights available (e.g., poly-phen scores) they could have be specified in the Annotation tab.

16 A weight proportional to the inverse of the standard deviation of allele counts is being used in the calculation. If there were other weights available (e.g., poly-phen scores) they could have be specified in the Annotation tab. Single-SNP Genes are excluded from this analysis. 12) Click Run to start the process. The output is similar to the previous outputs, with the exception that each point on the Manhattan plot represents a gene s p-value as opposed to the individual SNP p-value. This completes the advanced genetics module. Most dialogs and output from other Genetics applications will be variations of those covered in the basic and advanced genetics modules. Page 16

Step-by-Step Guide to Relatedness and Association Mapping Contents

Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...