LD vignette Measures of linkage disequilibrium

Similar documents
Genetic Analysis. Page 1

Spotter Documentation Version 0.5, Released 4/12/2010

Estimation of haplotypes

The LDheatmap Package

Step-by-Step Guide to Basic Genetic Analysis

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Chapter 6: Cluster Analysis

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Network Based Models For Analysis of SNPs Yalta Opt

Data input vignette Reading genotype data in snpstats

Package FunciSNP. November 16, 2018

Genetic type 1 Error Calculator (GEC)

Package NetCluster. R topics documented: February 19, Type Package Version 0.2 Date Title Clustering for networks

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park

Step-by-Step Guide to Relatedness and Association Mapping Contents

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Section 2.2 Graphs of Linear Functions

10701 Machine Learning. Clustering

points are stationed arbitrarily close to one corner of the square ( n+1

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010

Release Notes. JMP Genomics. Version 4.0

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

Computing with large data sets

Package asymld. August 29, 2016

1. Carlos A. Felippa, Introduction to Finite Element Methods,

Package optimus. March 24, 2017

Chapter 12: Quadratic and Cubic Graphs

Grade 8: Content and Reporting Targets

6.801/866. Segmentation and Line Fitting. T. Darrell

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

UNIBALANCE Users Manual. Marcin Macutkiewicz and Roger M. Cooke

Motivation. Technical Background

Space Filling Curves and Hierarchical Basis. Klaus Speer

= 2x c 2 x 2cx +6x+9 = c 2 x. Thus we want (2c +6)x 2 +9x=c 2 x. 2c+6=0and9 c 2 =0. c= 3

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Package HMMASE. February 4, HMMASE R package

Preliminary Figures for Renormalizing Illumina SNP Cell Line Data

An Introduction to Graph Theory

Algorithms and Complexity

Analyzing Genomic Data with NOJAH

Sparse matrices, graphs, and tree elimination

Didacticiel Études de cas

Hierarchical clustering

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Quality control of array genotyping data with argyle Andrew P Morgan

Fathom Dynamic Data TM Version 2 Specifications

For our example, we will look at the following factors and factor levels.

School District of Marshfield Mathematics Standards

MSA220 - Statistical Learning for Big Data

Supplementary text S6 Comparison studies on simulated data

Package SimGbyE. July 20, 2009

8. MINITAB COMMANDS WEEK-BY-WEEK

Unsupervised Learning and Clustering

PRSice: Polygenic Risk Score software - Vignette

Estimating. Local Ancestry in admixed Populations (LAMP)

2 + (-2) = 0. Hinojosa 7 th. Math Vocabulary Words. Unit 1. Word Definition Picture. The opposite of a number. Additive Inverse

Monte Carlo Integration

Y7 Learning Stage 1. Y7 Learning Stage 2. Y7 Learning Stage 3

An introduction to interpolation and splines

Visual Analytics. Visualizing multivariate data:

Exploratory data analysis for microarrays

Situation 34: Mean Median Prepared at Penn State Mid-Atlantic Center for Mathematics Teaching and Learning June 29, 2005 Sue, Evan, Donna

Affymetrix GeneChip DNA Analysis Software

Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data

How to use FSBforecast Excel add in for regression analysis

Geometry. Cluster: Experiment with transformations in the plane. G.CO.1 G.CO.2. Common Core Institute

Lecture 25: Review I

Project 11 Graphs (Using MS Excel Version )

6 Randomized rounding of semidefinite programs

Step-by-Step Guide to Advanced Genetic Analysis

COMMUNITY UNIT SCHOOL DISTRICT 200

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

implementing the breadth-first search algorithm implementing the depth-first search algorithm

Iterative Methods for Solving Linear Problems

Package bnbc. R topics documented: April 1, Version 1.4.0

Exploring Fractals through Geometry and Algebra. Kelly Deckelman Ben Eggleston Laura Mckenzie Patricia Parker-Davis Deanna Voss

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Transforming Objects in Inkscape Transform Menu. Move

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

An Approach to Identify the Number of Clusters

Light and refractive index

On Rainbow Cycles in Edge Colored Complete Graphs. S. Akbari, O. Etesami, H. Mahini, M. Mahmoody. Abstract

16.410/413 Principles of Autonomy and Decision Making

Math 182. Assignment #4: Least Squares

Common Core Vocabulary and Representations

Organisation and Presentation of Data in Medical Research Dr K Saji.MD(Hom)

Linear Methods for Regression and Shrinkage Methods

CS Data Structures and Algorithm Analysis

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling

2. (a) Explain when the Quick sort is preferred to merge sort and vice-versa.

Week 7 Picturing Network. Vahe and Bethany

Hierarchical Clustering / Dendrograms

Using Hamming Distance as Information for Clustering SNP. Sets and Testing for Disease Association

5 Mathematics Curriculum. Module Overview... i. Topic A: Concepts of Volume... 5.A.1

Click on "+" button Select your VCF data files (see #Input Formats->1 above) Remove file from files list:

TELCOM2125: Network Science and Analysis

Exploratory Data Analysis

Integers & Absolute Value Properties of Addition Add Integers Subtract Integers. Add & Subtract Like Fractions Add & Subtract Unlike Fractions

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Transcription:

LD vignette Measures of linkage disequilibrium David Clayton June 13, 2018 Calculating linkage disequilibrium statistics We shall first load some illustrative data. > data(ld.example) The data are drawn from the International HapMap Project and concern 603 SNPs over a 1mb region of chromosome 22 in sample of Europeans (ceph.1mb) and a sample of Africans (yri.1mb): > ceph.1mb A SnpMatrix with 90 rows and 603 columns Row names: NA06985... NA12892 Col names: rs5993821... rs5747302 > yri.1mb A SnpMatrix with 90 rows and 603 columns Row names: NA18500... NA19240 Col names: rs5993821... rs5747302 The details of these SNP are stored in the dataframe support.ld: > head(support.ld) dbsnpalleles Assignment Chromosome Position Strand rs5993821 G/T G/T chr22 15516658 + rs5993848 C/G C/G chr22 15529033 + rs361944 C/G C/G chr22 15544372 + rs361995 C/T C/T chr22 15544478 + rs361799 C/T C/T chr22 15544773 + rs361973 A/G A/G chr22 15549522 + 1

The function for calculating measures of linkage disequilibrium (LD) in snpstats is ld. The following two commands call this function to calculate the D-prime and R-squared measures of LD between pairs of SNPs for the European and African samples: > ld.ceph <- ld(ceph.1mb, stats=c("d.prime", "R.squared"), depth=100) > ld.yri <- ld(yri.1mb, stats=c("d.prime", "R.squared"), depth=100) The argument depth specifies the maximum separation between pairs of SNPs to be considered, so that depth=1 would have specified calculation of LD only between immediately adjacent SNPs. Both ld.ceph and ld.yri are lists with two elements each, named D.prime and R.squared. These elements are (upper triangular) band matrices, stored in a packed form defined in the Matrix package. They are too large to be listed, but the Matrix package provides an image method, a convenient way to examine patterns in the matrices. You should look at these carefully and note any differences. > image(ld.ceph$d.prime, lwd=0) 2

100 200 Row 300 400 500 > image(ld.yri$d.prime, lwd=0) 100 200 300 400 500 Column Dimensions: 603 x 603 3

100 200 Row 300 400 500 The important things to note are 100 200 300 400 500 Column Dimensions: 603 x 603 1. there are fairly well-defined blocks of LD, and 2. LD is more pronounced in the Europeans than in the Africans. The second point is demonstrated by extracting the D-prime values from the matrices (they are to be found in a slot named x) and calculating quartiles of their distribution: > quantile(ld.ceph$d.prime@x, na.rm=true) 0% 25% 50% 75% 100% 0.0000000 0.1284448 0.2966491 0.6373796 1.0000000 4

> quantile(ld.yri$d.prime@x, na.rm=true) 0% 25% 50% 75% 100% 0.0000000 0.1066341 0.2438494 0.5098348 1.0000000 If preferred, image can produce colour plots. We first create a set of 10 colours ranging from yellow (for low values) to red (for high values) > spectrum <- rainbow(10, start=0, end=1/6)[10:1] and plot the image, with a colour key down its right hand side > image(ld.ceph$d.prime, lwd=0, cuts=9, col.regions=spectrum, colorkey=true) 1.0 100 0.8 200 0.6 Row 300 0.4 400 0.2 500 0.0 100 200 300 400 500 Column Dimensions: 603 x 603 5

The R-squared matrices provide similar pictures, although they are rather less regular. To show this clearly, we focus on the 200 SNPs starting from the 75-th, using the European data > use <- 75:274 > image(ld.ceph$d.prime[use,use], lwd=0) 50 Row 100 150 50 100 150 Column Dimensions: 200 x 200 > image(ld.ceph$r.squared[use,use], lwd=0) 6

50 Row 100 150 50 100 150 Column Dimensions: 200 x 200 The R-squared values are smaller and there are holes in the LD blocks; SNPs within an LD block do not necessarily have large R-squared between them. This is further demonstrated in the next section. D-prime, R-squared, and distance To examine the relationship between LD and physical distance, we first need to construct a similar matrix holding the physical distances. This is carried out, by first calculating each off-diagonal, and then combining them into a band matrix 7

> pos <- support.ld$position > diags <- vector("list", 100) > for (i in 1:100) diags[[i]] <- pos[(i+1):603] - pos[1:(603-i)] > dist <- bandsparse(603, k=1:100, diagonals=diags) The values in the body of the band matrix are contained in a slot named x, so the following commands extract the physical distances and the corresponding LD statistics for the Europeans: > distance <- dist@x > D.prime <- ld.ceph$d.prime@x > R.squared <- ld.ceph$r.squared@x These are very long vectors so we use the hexbin package to produce abreviated plots. We first demonstrate the relationship between D-prime and R-squared > plot(hexbin(d.prime^2, R.squared)) 8

R.squared 1 0.8 0.6 0.4 0.2 0 Counts 13643 12790 11938 11085 10232 9380 8527 7675 6822 5969 5117 4264 3412 2559 1706 854 1 0 0.2 0.4 0.6 0.8 1 D.prime^2 We see that the square of D-prime provides an upper bound for R-squared; a high D-prime indicates the potential for two SNPs to be highly correlated, but they need not be. The following commands examine the relationship between the two LD measures and physical distance > plot(hexbin(distance, D.prime, xbin=10)) 9

D.prime 1 0.8 0.6 0.4 0.2 0 Counts 1900 1781 1663 1544 1425 1307 1188 1069 950 832 713 594 476 357 238 120 1 0 50000 1e+05 150000 2e+05 distance > plot(hexbin(distance, R.squared, xbin=10)) 10

R.squared 1 0.8 0.6 0.4 0.2 0 Counts 5711 5354 4997 4640 4284 3927 3570 3213 2856 2499 2142 1785 1428 1072 715 358 1 0 50000 1e+05 150000 2e+05 distance Although the data are very noisy, the first plot is consistent with an approximately exponential decline in mean D-prime with distance, as predicted by the Malecot model. A view of the calculations To understand the calculations let us consider the first and fifth SNPs in the Europeans. We shall first converting these to character data for legibility, and then tabulate the two-snp genotypes, saving the 3 3 table of genotype frequencies as tab33: 11

> snp1 <- as(ceph.1mb[,1], "character") > snp5 <- as(ceph.1mb[,5], "character") > tab33 <- table(snp1, snp5) > tab33 snp5 snp1 A/A A/B B/B A/A 6 21 17 A/B 2 18 17 B/B 0 2 7 These two SNPs have a moderately high D-prime, but a very low R-squared: > ld.ceph$d.prime[1,5] [1] 0.570161 > ld.ceph$r.squared[1,5] [1] 0.06628534 The LD measures cannot be directly calculated from the 3 3 table above, but from a 2 2 table of haplotype frequencies. In only eight cells around the periphery of the table we can unambiguously count haplotypes and these give us the following table of haplotype frequencies: rs361799 rs5993821 A B A 35 72 B 4 33 However, in the central cell of the 3 3 table (i.e. tab33[2,2]) we have 18 doubly heterozygous subjects, whose genotype could correspond either to the pair of haplotypes A-A/B-B or to the pair of haplotypes A-B/B-A. These are said to have unknown phase. The expected split between these possible phases is determined by a further measure of LD the odds ratio. If the odds ratio is θ, we expect a proportion θ/(1 + θ) of the doubly heterozygous subjects to be A-A/B-B, and a proportion 1/(1 + θ) to be A-B/B-A. We next use ld to obtain an estimate of this odds ratio 1 and, using this, we partition the doubly heterozygous individuals between the two possible phases: > OR <- ld(ceph.1mb[,1], ceph.1mb[,5], stats="or") > OR 1 Here ld is called with two arguments of class SnpStats and, since only the odds ratio is to be calculated, it returns the odds ratio rather than a list. 12

rs361799 rs5993821 4.163 > AABB <- tab33[2,2]*or/(1+or) > ABBA <- tab33[2,2]*1/(1+or) > AABB rs361799 rs5993821 14.51 > ABBA rs361799 rs5993821 3.486 We are now able to construct the table of haplotype frequencies: rs361799 rs5993821 A B A 49.51 75.49 B 7.486 47.51 It is easy to confirm that the odds ratio in this table, (49.51 47.51)/(75.49 7.486), corresponds closely with that given by the ld function. Having obtained the 2 2 table of haplotype frequencies, any LD statistic may be calculated. Of course, there is a circularity here; we needed to know the odds ratio in order to be able to construct the 2 2 table from which it is calculated! That is why these calculations are not simple. The usual method involves iterative solution using an EM algorithm: an initial guess at the odds ratio is used, as in the calculations above, to compute a new estimate, and these calculations are repeated until the estimate stabilizes. However, in snpstats the estimate is calculated in one step, by solving a cubic equation. The extent of LD around a point Often we wish to guage how far LD extends from a given point (for example, from a SNP which is associated with disease incidence). For illustrative purposes we shall consider the region surroung the 168-th SNP, rs2385786. We first calculate D-prime values for the 100 SNPs on either side of rs2385786, and their positions: > lr100 <- c(68:167, 169:268) > D.prime <- ld(ceph.1mb[,168], ceph.1mb[,lr100], stats="d.prime") > where <- pos[lr100] We now plot D.prime against position, adding a simple smoother: 13

> plot(where, D.prime) > lines(where, smooth(d.prime)) 15700000 15800000 15900000 16000000 0.0 0.2 0.4 0.6 0.8 1.0 where D.prime Although the data are somewhat noisy (the sample size is small), the region of LD is fairly clearly delineated. Selecting tag SNPs Several ways have been suggested to select a set of tag SNPs which can be used to test for associations in a given region. That described below is based upon a heirarchical cluster 14

analysis. We shall apply it to the region of high LD identified in the previous section, which lies between positions 1.579 10 7 and 1.587 10 7. The following commands identify which SNPs lie in this region, and extracts the relevant part of the R 2 matrix, as a symmetric matrix rather than as an upper triangular matrix. > use <- pos>=1.579e7 & pos<=1.587e7 > r2 <- forcesymmetric(ld.ceph$r.squared[use, use]) The next step is to convert (1 R 2 ) into a distance matrix, stored as required for the hierachical clustering function hclust, and to carry out a complete linkage cluster analysis > D <- as.dist(1-r2) > hc <- hclust(d, method="complete") To plot the dendrogram, we must first adjust the character size for legibility: > par(cex=0.5) > plot(hc) 15

Cluster Dendrogram Height 0.0 0.2 0.4 0.6 0.8 1.0 rs11913227 rs16981924 rs4819934 rs4819936 rs2385785 rs5748748 rs5748749 rs11704537 rs17733785 rs11914017 rs5748756 rs2215725 rs1981707 rs5748761 rs9618953 rs5992601 rs5992607 rs1981708 rs7510758 rs5994093 rs5994095 rs5992589 rs9306242 rs5748762 rs9618954 rs5748759 rs9618948 rs7287116 rs5748773 rs8137298 rs5992598 rs5748771 rs5748766 rs11089388 rs2399152 rs713673 rs5994098 rs933461 rs5748752 rs5748755 rs5748747 rs5748750 rs1024731 rs5748765 rs12167026 rs9618937 rs5994104 rs5992604 rs5748798 rs4819545 rs9606559 rs2041607 rs3948637 rs1024732 rs4819923 rs1990483 rs5994110 rs5748744 rs11089387 rs5992600 rs5994097 rs2385786 rs5992593 rs5748769 rs7291404 rs5992590 D hclust (*, "complete") The interpretation of this dendrogram is that, if we were to draw a horizontal line at a height of 0.5, then this would divide the SNPs into clusters in such a way that the value of (1 R 2 ) between any pair of SNPs in a cluster would be no more than 0.5 (so that R 2 would be at least 0.5). This can be carried out using the cutree function, which returns the cluster membership of each SNP: > clusters <- cutree(hc, h=0.5) > head(clusters) rs5994093 rs5748744 rs5992589 rs9306242 rs5994095 rs5992590 1 1 1 1 1 2 16

> table(clusters) clusters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 5 1 7 1 3 6 2 2 6 1 9 12 3 1 1 1 4 1 It can be seen that there are 18 clusters. To have a reasonable chance of picking up an association with the SNPs in this 80kb region, we would need to type a SNP from each one of these clusters. Of these, 7 SNPs would only tag themselves! A threshold R 2 of 0.5 might seem rather low. However, this is a worst case figure and most values of R 2 would be substantially better than this, particularly if an effort is made to choose tag SNPs which are in the center of clusters rather than on their edges. Also, this process has only considered tagging by single SNPs; it can be that two or more tag SNPs, taken together, can provide substantially better prediction than any one of them alone. 17