GemTools Documentation

Size: px

Start display at page:

Download "GemTools Documentation"

Jasmin Watkins
6 years ago
Views:

1 Literature: GemTools Documentation Bert Klei and Brian P. Kent February 2011 This software is described in GemTools: a fast and efficient approach to estimating genetic ancestry (in preparation) Klei L, Kent BP, Melhem N, Devlin B, Roeder K The GemTools functions are primarily based on the methods described in Discovering genetic ancestry using spectral graph theory. Genet Epidemiol Jan;34(1):51 9. Lee AB, Luca D, Klei L, Devlin B, Roeder K. The projection methods are described in: Using ancestry matching to combine family based and unrelated samples for genome wide association studies. Stat Med 2010 Dec 10;29(28): Crossett A, Kent BP, Klei L, Ringquist S, Trucco M, Roeder K, Devlin B. GEM uses the spectral graph methods described in Lee et al. (2010) to find a low dimensional representation of the genetic similarities between individuals, which is referred to as an eigenmap. A key feature of the eigenmap is D, the number of eigenvectors required to represent the variability in the data. For instance, to separate 3 major ancestry groups we usually need D=2 dimensions. D=1 models a cline. If D=0, this suggests the sample is genetically homogeneous. D is determined using a test of significance (Lee et al 2010). Assuming an eigenmap is constructed using a representative base sample, additional individuals can be projected onto the map using the Nystrom approximation (Crosset et al. 2010). Description GemTools is a package of functions to help the user account for genetic ancestry of a large number of individuals using spectral graph theory. The package has three components:

2 1) dacgem This function organizes a large number of individuals into smaller clusters of individuals with similar genetic ancestry. The approach samples a representative base sample to create an eigenmap of the genotype information. The remaining non base individuals are then projected into that eigenmap using a Nystrom projection. Working from the base sample, clusters are formed. Non base individuals are assigned to the cluster of their genetically closest base neighbor. 2) clustergem After the population is divided in clusters of manageable size, this function further sub divides the dacgem clusters until each subcluster is genetically homogeneous (D=0). In case of relative small data sets (<2000) this function can be used on the original data to generate traditional eigenvectors to account for genetic ancestry. 3) ccmatchgem This function finds the best matches among cases and controls based on ancestry within the clusters generated by dacgem. Matches are determined with the function fullmatch from the R library optmatch. Fullmatch creates strata that include 1 case matched to 1 or more controls, or 1 control matched to 1 or more cases. This function can be used on the subdivided data or on the complete data if it is relatively small in size. Results of this function can be used as strata in conditional logistic regression, or other genetic analyses. In addition to the three main functions there are 3 additional functions that will help the user plot (plotclusterspdf) and save (saveclusterstxt) the results from dacgem and clustergem as well as save (savematchtxt) the results from ccmatchgem. Part of the GEM package utilizes the library optmatch. This library should be downloaded from the R repository and loaded (library( optmatch )) before the function ccmatchgem is used. dacgem Usage dacgem(gnt, id, n.ind.base = 500, max.ind.cluster = 1000, min.dim = 2, max.dim = 15, method =c( homogeneous, quick ), verbose = c(true, FALSE)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. n.ind.base: the number of individuals who are chosen at random to be in the base set. This is also the number used in the base for all levels of sub clustering.

3 max.ind.cluster: the maximum number of individuals allowed in each final cluster. When after the initial clustering there are still more than max.ind.cluster individuals in a cluster, that cluster is further broken up into subclusters until each subcluster is small enough. min.dim: the minimum number of dimensions to be utilized in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. method: the desired method of clustering. The homogeneous" method is the default and creates clusters that have no significant spectral dimensions for the current (sub) data set. The quick method creates a number of clusters equal to the number of significant dimensions plus one in the spectral decomposition of the base set. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default. Details min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if the base set for a particular cluster has D=0, min.dim dimensions will be used in the eigenmap of that cluster. When calculating the genetic distance between individuals max(d,min.dim) dimensions are used. The chosen method of clustering is used consistently at each level of the algorithm. When the homogeneous clustering method is applied to a cluster that already has D=0 the algorithm continutes to produce subclusters until the number of individuals in each cluster is less than max.ind.cluster. The quick method of clustering splits each cluster into max(d+1,2) subclusters. When D=0, it splits the cluster into the minimum number of groups necessary to get each group to have fewer than max.ind.cluster members. The quick method is recommended when the homogenous method creates many small subsets even though there were few significant dimensions. One reason for many small subsets is the existence of family members in the data, in particular twins, full sibs, and parent offspring pairs. These should be removed before clustering. clusters: a vector with final cluster labels (strings) for each individual. The names of the vector are the same as 'id'. The top level clusters are given by the first digit. Clusters that are broken up further are indicated by a _ symbol, ie. 3_1_2 indicates main cluster 3, subcluster 1, and subsubcluster 2. frames: only returned if verbose is set to TRUE. A list of data frames with detailed results of each level of clustering and subclustering. In each data frame, the rownames are individual ID strings. Column 1 is "cluster", the cluster labeling for that round of clustering. Column 2 is "is.base" which is 1 if the individual was selected to be in the base set for that round of clustering and 0 if the individual was projected into the eigenmap. The remaining columns are the eigenvectors from that round of clustering and projection. dictionary: only returned if verbose is set to TRUE. A data frame that describes which cluster was broken up in each round. The first column is the index of the 'frames' list and the round of clustering, the second column is the cluster that was broken into subclusters in that round, and the third column is D, the number of significant spectral dimensions for that round.

4 Example A worked out example is provided at the end of this file. Here we provide a sketch of a genotype input file and a bit of R code to show the use of dacgem This genotype input file is for the first 5 individuals and 10 SNP, assume the name of the file is gnt.in.txt: Ind Ind Ind Ind Ind etc SNP genotypes are coded as an allele count: i.e., 0 for the 1/1 genotype, 1 for the 1/2 genotype, 2 for the 2/2 genotype and anything else for all others (in the example 3 denotes 0/0). The following R code would ready these data for processing: gnt = read.table( gnt.in.txt, header = F) id = as.matrix(gnt[,1]) gnt = as.matrix(gnt[, 1]) gnt[gnt < 0 gnt > 2] = NA example.out = dacgem(id = id, gnt = gnt) ###this example is too small to work clustergem Usage clustergem(gnt, id, pre.clusters =NULL, min.dim = 2, max.dim = 15, verbose = c(true, FALSE)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. pre.cluster: a vector with information on the clusters to process. min.dim: the minimum number of dimensions to be considered significant in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default.

5 Details pre.cluster is usually the output from dacgem. When there is no need to use dacgem to create manageable clusters using dacgem, pre.cluster does not need to be specified and all the data will be treated as coming from one cluster. min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if a particular cluster has 0 significant spectral dimensions, there will still be min.dim + 1 dimensions in the results for the branching of that cluster. When calculating the genetic distance between individuals a minimum number of min.dim dimensions are used even though there might be fewer significant dimensions. clusters: a vector with final cluster labels (strings) for each individual. The names of the vector are the same as 'id'. The top level clusters are given by the first digit. Clusters that are broken up further are indicated by a _" symbol. frames: only returned if verbose is set to TRUE. A list of data frames with detailed results of each level of clustering and subclustering. In each data frame, the rownames are individual ID strings. Column 1 is "cluster", the cluster labeling for that round of clustering. Column 2 is "is.base" which is 1 if the individual was selected to be in the base set for that round of clustering and 0 if the individual was projected into the base eigenspace. The remaining columns are the eigenvectors from that round of clustering and projection. dictionary: only returned if verbose is set to TRUE. A data frame that describes which cluster was broken up in each round. The first column is the index of the 'frames' list and the round of clustering, the second column is the cluster that was broken into subclusters in that round, and the third column is the number of significant spectral dimensions in the base set for that round. Example When data id and gnt are pre clustered using dacgem and the resulting output of dacgem is named example.out: example.cluster = clustergem(gnt = gnt, id = id, pre.clusters = example.out$clusters) When no pre clustering is available: example.cluster = clustergem(gnt = gnt, id = id) ccmatchgem Usage Restriction ccmatchgem relies on the optmatch package. Because of the usage restrictions of optmatch, ccmatchgem should only be used for academic purposes.

6 Usage ccmatchgem(gnt, id, dx, cdx = NULL, pre.clusters = NULL, min.dim = 2, max.dim = 15, verbose = c(true,false)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. dx: a vector or matrix with case control status for the individuals in id. cdx: a string with the name of the disease information to use. pre.clusters: a vector with information on the clusters to process. min.dim: the minimum number of dimensions to be considered significant in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default. Details pre.cluster is usually the output from dacgem. When there is no need to use dacgem to create manageable clusters using dacgem, pre.cluster does not need to be specified and all the data will be clustered as one group. dx can either be a vector or a matrix with disease diagnosis information. Individuals coded 2 are considered to be cases, those that are coded 1 are controls, all others are considered to have unknown diagnosis and will not be used for matching. Either the attribute names in case of dx being a vector, or rownames in case of dx being a matrix will be used to match the diagnosis information to the id. cdx is only required when more than one diagnosis is specified in dx, cdx will then be matched to the attribute colnames of dx to determine which column of dx to use as the case control status information. min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if the base set for a particular cluster has 0 significant spectral dimensions, there will still be min.dim + 1 dimensions in the results for the branching of that cluster. When calculating the genetic distance between individuals a minimum number of min.dim dimensions are used even though there might be fewer significant dimensions. strata: a vector with final case control strata labels (strings) for each individual. The names of the vector are the same as 'id'. dx: a vector with the diagnosis status for each individual dist: only when verbose = TRUE, a vector with the distance of the individual to its closest genetic neighbor of the opposite diagnosis (case > closest control, control > closest case).

7 closest: only when verbose = TRUE, a vector of ids of the closest neighbors of the opposite diagnosis. Example Assume the following information is stored in the diagnosis file example.dx DX1 DX2 Ind1 2 2 Ind2 2 0 Ind4 2 2 Ind6 1 1 Ind etc. Read this information using the following R command Dx = read.table( example.dx, header = T) When data (id and gnt) are pre clustered using dacgem with resulting output in example.out, the command to match cases to controls for DX1 using ccmatchgem is: example.match = ccmatchgem(id = id, gnt = gnt, pre.cluster = example.out$clusters, dx = Dx, cdx = DX1 ) When no pre clustering is needed and the diagnosis file example2.dx has the following lay out: DXalt Ind1 2 Ind2 2 Ind4 0 Ind6 1 Ind23 1 etc. Dx = read.table( example2.dx, header = T) Example2.match = ccmatchgem(id = id, gnt = gnt, dx = Dx) plotclusterspdf Usage plotclusterspdf(out, step, root.pdf.file="anc_cluster")

8 Arguments out: Data frame in the format produced by dacgem or clustergem using the option verbose = TRUE in those two function calls. step: a numeric indicating the frames.index in out$dictionary. root: the root of the filename to use for the.pdf file. This name will be augmented with trunk from out$dictionary$trunk[step]. Default is anc_cluster. Details The symbols used for plotting are A for cluster 1, B for cluster 2, etc. When more than 26 clusters are formed in one step a will be used for cluster 27, b for cluster 28, etc. Ancestry plots created from dacgem output will show the base individuals plotted over the projected ones. In general the projected individuals will be concentrated in the center of the plots with the base individuals spread out, filling the complete space. This is typical when using projections. EV.0 is never plotted, this eigenvector represents an overall mean and is used in calculating distances between individuals. No values are generated by this function Example Assume the following is the information stored in example.out$dictionary from dacgem frames.index trunk base.sig.dims To plot the initial ancestry cluster for the full data (trunk = 0) in a pdf file starting with example, issue the following command plotclusterpdf(out = example.out, step = 1, file= example ) To plot the subclusters in trunk 3 issue the following command: plotclusterpdf(out = example.out, step = 2, file = example ) saveclusterstxt Usage saveclusterstxt(out, step, root.txt.file="anc_cluster") Arguments

9 out: Data frame in the format produced by dacgem or clustergem using the option verbose = TRUE in those two function calls. step: a numeric matching the values in frames.index in out$dictionary. root: the root of the filename to use for the.txt file. This name will be augmented with trunk from out$dictionary$trunk[step]. Default is anc_cluster. Details No values are generated by this function Example Assume the following information is stored in example.cluster$dictionary from clustergem frames.index trunk base.sig.dims _ _ _3 0 To save the ancestry information from trunk 3_2 to a txt file starting with example issue the following command: saveclustertxt(out = example.cluster, step = 4, file = example ) savematchestxt Usage saveclusterstxt(results, root.txt.file="matches") Arguments results: Data frame in the format produced by ccmatchgem root: the root of the filename to use for the.txt file. Details No values are generated by this function

10 Example Write the result from ccmatchgem that were stored in ccmatchgemresults to a file with the name matching.example.txt. savematchestxt(results = ccmatchgemresults, file = matching.example ) PRACTICAL NOTES Computer Requirements The method has been used successfully with a dataset of ~20,000 individuals and 12,000 selected SNP. Memory requirements for this data were ~5Gb and it took ~40 minutes to run the function dacgem on our computer (AMD Dual Core Opteron processor running at 2.6GHz with 32Gb of RAM). When more memory is available, larger datasets can be used. The method is approximately linear in memory requirements and computing time for both number of individuals and number of SNP. For large datasets you do need a computer with a 64 bit operating system and adequate amount of RAM (8Gb or more). Data Quality When using GemTools it advisable to use a set of ~5K to ~20K high quality SNPs. This means a high completion rate (> 99.9%), and minor allele frequency > 0.01 for the SNP. It is also suggested to take SNPs that are in low LD with each other (r 2 <0.01). For individuals, the data should be screened to remove duplicates and close relatives (full sibs, parent offspring). When these quality checks have not been used, GEM tends to find spurious dimensions of ancestry as well as many small homogeneous clusters with fewer than 5 individuals. Typical Results When starting with a global population it is typical to find 3 or 4 dimensions of ancestry on the first pass. This will break the global population in roughly African, Asian (East), Asian (South), European, and Latin ancestry. Depending on the sizes of the subclusters they will then be broken up in smaller ancestry groups. For African one typically sees 3 or 4 subgroups, a North South and East West cline can typically be found for the Europeans, etc. Keep in mind that dacgem will keep dividing clusters until all the clusters have fewer than max.ind.cluster individuals in them. Some of the later splits might just be splits to satisfy that requirement even though there is no real reason to split as far as ancestry is concerned. EXTENDED HGDP EXAMPLE

11 Genomic DNA samples from 1,043 individuals from around the world were collected by the Human Genome Diversity Project (HGDP), in a collaboration with the Centre Etude Polymorphism Humain (CEPH) in Paris. They represent 51 different populations from Africa, Europe, the Middle East, South and Central Asia, East Asia, Oceania and the Americas. For details on the individuals in this collection, see H. Cann et al. Science 296: (2002) and its Supplemental Data; Rosenberg et al. Science 298: (2002); and Rosenberg et al. PLoS Genetics 1: (2005). In this example we focus on individuals from two continents (Africa and Europe) with 4 and 7 tribes representing each continent, respectively. The African tribes are Biaka Pygmies (102), Mandenka (103), Mbuti Pygmies (104), and Yoruba (106). Tribes representing Europe are Adygei (538), French (539), French Basques (540), Italian (541), Orcadian (542), Russian (543), and Sardinian (544). The numbers between the brackets represent the last three digits of the id that is used in the example, i.e., HGDP123456_103 is an individual from the Mandenka tribe. In the file HGDP_example.R we provide a worked example of these data stored in a gzipped file HGDP.sub.gnt.gz which fully utilizes GemTools. The example R code includes extensive comments. In addition to analysis of the population structure, three approaches are provided that exhibit how to use the output from GemTools to control for structure in an analysis of association between genotype and phenotype.

Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation

The American Journal of Human Genetics Supplemental Data Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation Chaolong Wang,