GemTools Documentation
|
|
- Jasmin Watkins
- 6 years ago
- Views:
Transcription
1 Literature: GemTools Documentation Bert Klei and Brian P. Kent February 2011 This software is described in GemTools: a fast and efficient approach to estimating genetic ancestry (in preparation) Klei L, Kent BP, Melhem N, Devlin B, Roeder K The GemTools functions are primarily based on the methods described in Discovering genetic ancestry using spectral graph theory. Genet Epidemiol Jan;34(1):51 9. Lee AB, Luca D, Klei L, Devlin B, Roeder K. The projection methods are described in: Using ancestry matching to combine family based and unrelated samples for genome wide association studies. Stat Med 2010 Dec 10;29(28): Crossett A, Kent BP, Klei L, Ringquist S, Trucco M, Roeder K, Devlin B. GEM uses the spectral graph methods described in Lee et al. (2010) to find a low dimensional representation of the genetic similarities between individuals, which is referred to as an eigenmap. A key feature of the eigenmap is D, the number of eigenvectors required to represent the variability in the data. For instance, to separate 3 major ancestry groups we usually need D=2 dimensions. D=1 models a cline. If D=0, this suggests the sample is genetically homogeneous. D is determined using a test of significance (Lee et al 2010). Assuming an eigenmap is constructed using a representative base sample, additional individuals can be projected onto the map using the Nystrom approximation (Crosset et al. 2010). Description GemTools is a package of functions to help the user account for genetic ancestry of a large number of individuals using spectral graph theory. The package has three components:
2 1) dacgem This function organizes a large number of individuals into smaller clusters of individuals with similar genetic ancestry. The approach samples a representative base sample to create an eigenmap of the genotype information. The remaining non base individuals are then projected into that eigenmap using a Nystrom projection. Working from the base sample, clusters are formed. Non base individuals are assigned to the cluster of their genetically closest base neighbor. 2) clustergem After the population is divided in clusters of manageable size, this function further sub divides the dacgem clusters until each subcluster is genetically homogeneous (D=0). In case of relative small data sets (<2000) this function can be used on the original data to generate traditional eigenvectors to account for genetic ancestry. 3) ccmatchgem This function finds the best matches among cases and controls based on ancestry within the clusters generated by dacgem. Matches are determined with the function fullmatch from the R library optmatch. Fullmatch creates strata that include 1 case matched to 1 or more controls, or 1 control matched to 1 or more cases. This function can be used on the subdivided data or on the complete data if it is relatively small in size. Results of this function can be used as strata in conditional logistic regression, or other genetic analyses. In addition to the three main functions there are 3 additional functions that will help the user plot (plotclusterspdf) and save (saveclusterstxt) the results from dacgem and clustergem as well as save (savematchtxt) the results from ccmatchgem. Part of the GEM package utilizes the library optmatch. This library should be downloaded from the R repository and loaded (library( optmatch )) before the function ccmatchgem is used. dacgem Usage dacgem(gnt, id, n.ind.base = 500, max.ind.cluster = 1000, min.dim = 2, max.dim = 15, method =c( homogeneous, quick ), verbose = c(true, FALSE)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. n.ind.base: the number of individuals who are chosen at random to be in the base set. This is also the number used in the base for all levels of sub clustering.
3 max.ind.cluster: the maximum number of individuals allowed in each final cluster. When after the initial clustering there are still more than max.ind.cluster individuals in a cluster, that cluster is further broken up into subclusters until each subcluster is small enough. min.dim: the minimum number of dimensions to be utilized in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. method: the desired method of clustering. The homogeneous" method is the default and creates clusters that have no significant spectral dimensions for the current (sub) data set. The quick method creates a number of clusters equal to the number of significant dimensions plus one in the spectral decomposition of the base set. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default. Details min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if the base set for a particular cluster has D=0, min.dim dimensions will be used in the eigenmap of that cluster. When calculating the genetic distance between individuals max(d,min.dim) dimensions are used. The chosen method of clustering is used consistently at each level of the algorithm. When the homogeneous clustering method is applied to a cluster that already has D=0 the algorithm continutes to produce subclusters until the number of individuals in each cluster is less than max.ind.cluster. The quick method of clustering splits each cluster into max(d+1,2) subclusters. When D=0, it splits the cluster into the minimum number of groups necessary to get each group to have fewer than max.ind.cluster members. The quick method is recommended when the homogenous method creates many small subsets even though there were few significant dimensions. One reason for many small subsets is the existence of family members in the data, in particular twins, full sibs, and parent offspring pairs. These should be removed before clustering. clusters: a vector with final cluster labels (strings) for each individual. The names of the vector are the same as 'id'. The top level clusters are given by the first digit. Clusters that are broken up further are indicated by a _ symbol, ie. 3_1_2 indicates main cluster 3, subcluster 1, and subsubcluster 2. frames: only returned if verbose is set to TRUE. A list of data frames with detailed results of each level of clustering and subclustering. In each data frame, the rownames are individual ID strings. Column 1 is "cluster", the cluster labeling for that round of clustering. Column 2 is "is.base" which is 1 if the individual was selected to be in the base set for that round of clustering and 0 if the individual was projected into the eigenmap. The remaining columns are the eigenvectors from that round of clustering and projection. dictionary: only returned if verbose is set to TRUE. A data frame that describes which cluster was broken up in each round. The first column is the index of the 'frames' list and the round of clustering, the second column is the cluster that was broken into subclusters in that round, and the third column is D, the number of significant spectral dimensions for that round.
4 Example A worked out example is provided at the end of this file. Here we provide a sketch of a genotype input file and a bit of R code to show the use of dacgem This genotype input file is for the first 5 individuals and 10 SNP, assume the name of the file is gnt.in.txt: Ind Ind Ind Ind Ind etc SNP genotypes are coded as an allele count: i.e., 0 for the 1/1 genotype, 1 for the 1/2 genotype, 2 for the 2/2 genotype and anything else for all others (in the example 3 denotes 0/0). The following R code would ready these data for processing: gnt = read.table( gnt.in.txt, header = F) id = as.matrix(gnt[,1]) gnt = as.matrix(gnt[, 1]) gnt[gnt < 0 gnt > 2] = NA example.out = dacgem(id = id, gnt = gnt) ###this example is too small to work clustergem Usage clustergem(gnt, id, pre.clusters =NULL, min.dim = 2, max.dim = 15, verbose = c(true, FALSE)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. pre.cluster: a vector with information on the clusters to process. min.dim: the minimum number of dimensions to be considered significant in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default.
5 Details pre.cluster is usually the output from dacgem. When there is no need to use dacgem to create manageable clusters using dacgem, pre.cluster does not need to be specified and all the data will be treated as coming from one cluster. min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if a particular cluster has 0 significant spectral dimensions, there will still be min.dim + 1 dimensions in the results for the branching of that cluster. When calculating the genetic distance between individuals a minimum number of min.dim dimensions are used even though there might be fewer significant dimensions. clusters: a vector with final cluster labels (strings) for each individual. The names of the vector are the same as 'id'. The top level clusters are given by the first digit. Clusters that are broken up further are indicated by a _" symbol. frames: only returned if verbose is set to TRUE. A list of data frames with detailed results of each level of clustering and subclustering. In each data frame, the rownames are individual ID strings. Column 1 is "cluster", the cluster labeling for that round of clustering. Column 2 is "is.base" which is 1 if the individual was selected to be in the base set for that round of clustering and 0 if the individual was projected into the base eigenspace. The remaining columns are the eigenvectors from that round of clustering and projection. dictionary: only returned if verbose is set to TRUE. A data frame that describes which cluster was broken up in each round. The first column is the index of the 'frames' list and the round of clustering, the second column is the cluster that was broken into subclusters in that round, and the third column is the number of significant spectral dimensions in the base set for that round. Example When data id and gnt are pre clustered using dacgem and the resulting output of dacgem is named example.out: example.cluster = clustergem(gnt = gnt, id = id, pre.clusters = example.out$clusters) When no pre clustering is available: example.cluster = clustergem(gnt = gnt, id = id) ccmatchgem Usage Restriction ccmatchgem relies on the optmatch package. Because of the usage restrictions of optmatch, ccmatchgem should only be used for academic purposes.
6 Usage ccmatchgem(gnt, id, dx, cdx = NULL, pre.clusters = NULL, min.dim = 2, max.dim = 15, verbose = c(true,false)) Arguments gnt: the genotype matrix. Rows are individuals and columns are SNPs. id: a vector of unique ID strings for each person in the 'gnt' matrix. dx: a vector or matrix with case control status for the individuals in id. cdx: a string with the name of the disease information to use. pre.clusters: a vector with information on the clusters to process. min.dim: the minimum number of dimensions to be considered significant in spectral decompositions. max.dim: the maximum number of dimensions to be considered significant in spectral decompositions. verbose: toggles the amount of output, both written to the screen and in the values returned. TRUE is the default. Details pre.cluster is usually the output from dacgem. When there is no need to use dacgem to create manageable clusters using dacgem, pre.cluster does not need to be specified and all the data will be clustered as one group. dx can either be a vector or a matrix with disease diagnosis information. Individuals coded 2 are considered to be cases, those that are coded 1 are controls, all others are considered to have unknown diagnosis and will not be used for matching. Either the attribute names in case of dx being a vector, or rownames in case of dx being a matrix will be used to match the diagnosis information to the id. cdx is only required when more than one diagnosis is specified in dx, cdx will then be matched to the attribute colnames of dx to determine which column of dx to use as the case control status information. min.dim influences the dimension of the eigenvectors in the elements of frames (the list of data frames with results from each round of clustering. Even if the base set for a particular cluster has 0 significant spectral dimensions, there will still be min.dim + 1 dimensions in the results for the branching of that cluster. When calculating the genetic distance between individuals a minimum number of min.dim dimensions are used even though there might be fewer significant dimensions. strata: a vector with final case control strata labels (strings) for each individual. The names of the vector are the same as 'id'. dx: a vector with the diagnosis status for each individual dist: only when verbose = TRUE, a vector with the distance of the individual to its closest genetic neighbor of the opposite diagnosis (case > closest control, control > closest case).
7 closest: only when verbose = TRUE, a vector of ids of the closest neighbors of the opposite diagnosis. Example Assume the following information is stored in the diagnosis file example.dx DX1 DX2 Ind1 2 2 Ind2 2 0 Ind4 2 2 Ind6 1 1 Ind etc. Read this information using the following R command Dx = read.table( example.dx, header = T) When data (id and gnt) are pre clustered using dacgem with resulting output in example.out, the command to match cases to controls for DX1 using ccmatchgem is: example.match = ccmatchgem(id = id, gnt = gnt, pre.cluster = example.out$clusters, dx = Dx, cdx = DX1 ) When no pre clustering is needed and the diagnosis file example2.dx has the following lay out: DXalt Ind1 2 Ind2 2 Ind4 0 Ind6 1 Ind23 1 etc. Dx = read.table( example2.dx, header = T) Example2.match = ccmatchgem(id = id, gnt = gnt, dx = Dx) plotclusterspdf Usage plotclusterspdf(out, step, root.pdf.file="anc_cluster")
8 Arguments out: Data frame in the format produced by dacgem or clustergem using the option verbose = TRUE in those two function calls. step: a numeric indicating the frames.index in out$dictionary. root: the root of the filename to use for the.pdf file. This name will be augmented with trunk from out$dictionary$trunk[step]. Default is anc_cluster. Details The symbols used for plotting are A for cluster 1, B for cluster 2, etc. When more than 26 clusters are formed in one step a will be used for cluster 27, b for cluster 28, etc. Ancestry plots created from dacgem output will show the base individuals plotted over the projected ones. In general the projected individuals will be concentrated in the center of the plots with the base individuals spread out, filling the complete space. This is typical when using projections. EV.0 is never plotted, this eigenvector represents an overall mean and is used in calculating distances between individuals. No values are generated by this function Example Assume the following is the information stored in example.out$dictionary from dacgem frames.index trunk base.sig.dims To plot the initial ancestry cluster for the full data (trunk = 0) in a pdf file starting with example, issue the following command plotclusterpdf(out = example.out, step = 1, file= example ) To plot the subclusters in trunk 3 issue the following command: plotclusterpdf(out = example.out, step = 2, file = example ) saveclusterstxt Usage saveclusterstxt(out, step, root.txt.file="anc_cluster") Arguments
9 out: Data frame in the format produced by dacgem or clustergem using the option verbose = TRUE in those two function calls. step: a numeric matching the values in frames.index in out$dictionary. root: the root of the filename to use for the.txt file. This name will be augmented with trunk from out$dictionary$trunk[step]. Default is anc_cluster. Details No values are generated by this function Example Assume the following information is stored in example.cluster$dictionary from clustergem frames.index trunk base.sig.dims _ _ _3 0 To save the ancestry information from trunk 3_2 to a txt file starting with example issue the following command: saveclustertxt(out = example.cluster, step = 4, file = example ) savematchestxt Usage saveclusterstxt(results, root.txt.file="matches") Arguments results: Data frame in the format produced by ccmatchgem root: the root of the filename to use for the.txt file. Details No values are generated by this function
10 Example Write the result from ccmatchgem that were stored in ccmatchgemresults to a file with the name matching.example.txt. savematchestxt(results = ccmatchgemresults, file = matching.example ) PRACTICAL NOTES Computer Requirements The method has been used successfully with a dataset of ~20,000 individuals and 12,000 selected SNP. Memory requirements for this data were ~5Gb and it took ~40 minutes to run the function dacgem on our computer (AMD Dual Core Opteron processor running at 2.6GHz with 32Gb of RAM). When more memory is available, larger datasets can be used. The method is approximately linear in memory requirements and computing time for both number of individuals and number of SNP. For large datasets you do need a computer with a 64 bit operating system and adequate amount of RAM (8Gb or more). Data Quality When using GemTools it advisable to use a set of ~5K to ~20K high quality SNPs. This means a high completion rate (> 99.9%), and minor allele frequency > 0.01 for the SNP. It is also suggested to take SNPs that are in low LD with each other (r 2 <0.01). For individuals, the data should be screened to remove duplicates and close relatives (full sibs, parent offspring). When these quality checks have not been used, GEM tends to find spurious dimensions of ancestry as well as many small homogeneous clusters with fewer than 5 individuals. Typical Results When starting with a global population it is typical to find 3 or 4 dimensions of ancestry on the first pass. This will break the global population in roughly African, Asian (East), Asian (South), European, and Latin ancestry. Depending on the sizes of the subclusters they will then be broken up in smaller ancestry groups. For African one typically sees 3 or 4 subgroups, a North South and East West cline can typically be found for the Europeans, etc. Keep in mind that dacgem will keep dividing clusters until all the clusters have fewer than max.ind.cluster individuals in them. Some of the later splits might just be splits to satisfy that requirement even though there is no real reason to split as far as ancestry is concerned. EXTENDED HGDP EXAMPLE
11 Genomic DNA samples from 1,043 individuals from around the world were collected by the Human Genome Diversity Project (HGDP), in a collaboration with the Centre Etude Polymorphism Humain (CEPH) in Paris. They represent 51 different populations from Africa, Europe, the Middle East, South and Central Asia, East Asia, Oceania and the Americas. For details on the individuals in this collection, see H. Cann et al. Science 296: (2002) and its Supplemental Data; Rosenberg et al. Science 298: (2002); and Rosenberg et al. PLoS Genetics 1: (2005). In this example we focus on individuals from two continents (Africa and Europe) with 4 and 7 tribes representing each continent, respectively. The African tribes are Biaka Pygmies (102), Mandenka (103), Mbuti Pygmies (104), and Yoruba (106). Tribes representing Europe are Adygei (538), French (539), French Basques (540), Italian (541), Orcadian (542), Russian (543), and Sardinian (544). The numbers between the brackets represent the last three digits of the id that is used in the example, i.e., HGDP123456_103 is an individual from the Mandenka tribe. In the file HGDP_example.R we provide a worked example of these data stored in a gzipped file HGDP.sub.gnt.gz which fully utilizes GemTools. The example R code includes extensive comments. In addition to analysis of the population structure, three approaches are provided that exhibit how to use the output from GemTools to control for structure in an analysis of association between genotype and phenotype.
Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation
The American Journal of Human Genetics Supplemental Data Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation Chaolong Wang,
More informationUser Manual for TreeMix v1.1. Joseph K. Pickrell, Jonathan K. Pritchard
User Manual for TreeMix v1.1 Joseph K. Pickrell, Jonathan K. Pritchard October 1, 2012 Contents 1 Introduction 2 2 Installation 2 3 Input file format 2 3.1 SNP data..........................................
More informationStep-by-Step Guide to Advanced Genetic Analysis
Step-by-Step Guide to Advanced Genetic Analysis Page 1 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options
More informationApplications of admixture models
Applications of admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price Applications of admixture models 1 / 27
More informationA short manual for LFMM (command-line version)
A short manual for LFMM (command-line version) Eric Frichot efrichot@gmail.com April 16, 2013 Please, print this reference manual only if it is necessary. This short manual aims to help users to run LFMM
More informationPackage ridge. R topics documented: February 15, Title Ridge Regression with automatic selection of the penalty parameter. Version 2.
Package ridge February 15, 2013 Title Ridge Regression with automatic selection of the penalty parameter Version 2.1-2 Date 2012-25-09 Author Erika Cule Linear and logistic ridge regression for small data
More informationSOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie
SOLOMON: Parentage Analysis 1 Corresponding author: Mark Christie christim@science.oregonstate.edu SOLOMON: Parentage Analysis 2 Table of Contents: Installing SOLOMON on Windows/Linux Pg. 3 Installing
More informationPackage GWAF. March 12, 2015
Type Package Package GWAF March 12, 2015 Title Genome-Wide Association/Interaction Analysis and Rare Variant Analysis with Family Data Version 2.2 Date 2015-03-12 Author Ming-Huei Chen
More informationPackage snpstatswriter
Type Package Package snpstatswriter February 20, 2015 Title Flexible writing of snpstats objects to flat files Version 1.5-6 Date 2013-12-05 Author Maintainer Write snpstats
More informationStep-by-Step Guide to Relatedness and Association Mapping Contents
Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...
More informationPackage GEM. R topics documented: January 31, Type Package
Type Package Package GEM January 31, 2018 Title GEM: fast association study for the interplay of Gene, Environment and Methylation Version 1.5.0 Date 2015-12-05 Author Hong Pan, Joanna D Holbrook, Neerja
More informationNetwork Based Models For Analysis of SNPs Yalta Opt
Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department
More informationBayesian analysis of genetic population structure using BAPS: Exercises
Bayesian analysis of genetic population structure using BAPS: Exercises p S u k S u p u,s S, Jukka Corander Department of Mathematics, Åbo Akademi University, Finland Exercise 1: Clustering of groups of
More informationLFMM version Reference Manual (Graphical User Interface version)
LFMM version 1.2 - Reference Manual (Graphical User Interface version) Eric Frichot 1, Sean Schoville 1, Guillaume Bouchard 2, Olivier François 1 * 1. Université Joseph Fourier Grenoble, Centre National
More informationcalled Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil
Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The
More informationPackage REGENT. R topics documented: August 19, 2015
Package REGENT August 19, 2015 Title Risk Estimation for Genetic and Environmental Traits Version 1.0.6 Date 2015-08-18 Author Daniel J.M. Crouch, Graham H.M. Goddard & Cathryn M. Lewis Maintainer Daniel
More informationPackage allehap. August 19, 2017
Package allehap August 19, 2017 Type Package Title Allele Imputation and Haplotype Reconstruction from Pedigree Databases Version 0.9.9 Date 2017-08-19 Author Nathan Medina-Rodriguez and Angelo Santana
More informationGMDR User Manual. GMDR software Beta 0.9. Updated March 2011
GMDR User Manual GMDR software Beta 0.9 Updated March 2011 1 As an open source project, the source code of GMDR is published and made available to the public, enabling anyone to copy, modify and redistribute
More informationPackage lodgwas. R topics documented: November 30, Type Package
Type Package Package lodgwas November 30, 2015 Title Genome-Wide Association Analysis of a Biomarker Accounting for Limit of Detection Version 1.0-7 Date 2015-11-10 Author Ahmad Vaez, Ilja M. Nolte, Peter
More informationSTAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS
STAT 3304/5304 Introduction to Statistical Computing Introduction to SAS What is SAS? SAS (originally an acronym for Statistical Analysis System, now it is not an acronym for anything) is a program designed
More informationPackage SMAT. January 29, 2013
Package SMAT January 29, 2013 Type Package Title Scaled Multiple-phenotype Association Test Version 0.98 Date 2013-01-26 Author Lin Li, Ph.D.; Elizabeth D. Schifano, Ph.D. Maintainer Lin Li ;
More informationCTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1
CTL mapping in R Danny Arends, Pjotr Prins, and Ritsert C. Jansen University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1 First written: Oct 2011 Last modified: Jan 2018 Abstract: Tutorial
More informationPackage globalgsa. February 19, 2015
Type Package Package globalgsa February 19, 2015 Title Global -Set Analysis for Association Studies. Version 1.0 Date 2013-10-22 Author Natalia Vilor, M.Luz Calle Maintainer Natalia Vilor
More informationBICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017
BICF Nano Course: GWAS GWAS Workflow Development using PLINK Julia Kozlitina Julia.Kozlitina@UTSouthwestern.edu April 28, 2017 Getting started Open the Terminal (Search -> Applications -> Terminal), and
More information4/4/16 Comp 555 Spring
4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering
More informationEstimating. Local Ancestry in admixed Populations (LAMP)
Estimating Local Ancestry in admixed Populations (LAMP) QIAN ZHANG 572 6/05/2014 Outline 1) Sketch Method 2) Algorithm 3) Simulated Data: Accuracy Varying Pop1-Pop2 Ancestries r 2 pruning threshold Number
More informationREAP Software Documentation
REAP Software Documentation Version 1.2 Timothy Thornton 1 Department of Biostatistics 1 The University of Washington 1 REAP A C program for estimating kinship coefficients and IBD sharing probabilities
More informationLecture 20: Clustering and Evolution
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique
More informationPackage RobustSNP. January 1, 2011
Package RobustSNP January 1, 2011 Type Package Title Robust SNP association tests under different genetic models, allowing for covariates Version 1.0 Depends mvtnorm,car,snpmatrix Date 2010-07-11 Author
More informationLecture 20: Clustering and Evolution
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other
More informationPackage LEA. April 23, 2016
Package LEA April 23, 2016 Title LEA: an R package for Landscape and Ecological Association Studies Version 1.2.0 Date 2014-09-17 Author , Olivier Francois
More informationDealing with heterogeneity: group-specific variances and stratified analyses
Dealing with heterogeneity: group-specific variances and stratified analyses Tamar Sofer July 2017 1 / 32 The HCHS/SOL population is quite heterogeneous 1. Due to admixture: Hispanics are admixed with
More informationCh.5 Classification and Clustering. In machine learning, there are two main types of learning problems, supervised and unsupervised learning.
Ch.5 Classification and Clustering In machine learning, there are two main types of learning problems, supervised and unsupervised learning. An analogy for the former is a French class where the teacher
More informationTRACE: fast and Robust Ancestry Coordinate Estimation version 1.02
TRACE: fast and Robust Ancestry Coordinate Estimation version 1.02 Chaolong Wang 1 Computational and Systems Biology Genome Institute of Singapore A*STAR, Singapore 138672, Singapore February 21, 2016
More informationSTENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015
STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................
More informationELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2
ELAI user manual Yongtao Guan Baylor College of Medicine Version 1.0 25 June 2015 Contents 1 Copyright 2 2 What ELAI Can Do 2 3 A simple example 2 4 Input file formats 3 4.1 Genotype file format....................................
More informationGenetic Analysis. Page 1
Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced
More informationImporting and Merging Data Tutorial
Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012 Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and
More informationMAGA: Meta-Analysis of Gene-level Associations
MAGA: Meta-Analysis of Gene-level Associations SYNOPSIS MAGA [--sfile] [--chr] OPTIONS Option Default Description --sfile specification.txt Select a specification file --chr Select a chromosome DESCRIPTION
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationStep-by-Step Guide to Basic Genetic Analysis
Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control
More informationAxiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)
Axiom Analysis Suite 4.0.1 Release Notes (For research use only. Not for use in diagnostic procedures.) Axiom Analysis Suite 4.0.1 includes the following changes/updates: 1. For library packages that support
More informationGMDR User Manual Version 1.0
GMDR User Manual Version 1.0 Oct 30, 2011 1 GMDR is a free, open-source interaction analysis tool, aimed to perform gene-gene interaction with generalized multifactor dimensionality methods. GMDR is being
More informationPackage allelematch. R topics documented: February 19, Type Package
Type Package Package allelematch February 19, 2015 Title Identifying unique multilocus genotypes where genotyping error and missing data may be present Version 2.5 Date 2014-09-18 Author Paul Galpern
More informationPackage gpart. November 19, 2018
Package gpart November 19, 2018 Title Human genome partitioning of dense sequencing data by identifying haplotype blocks Version 1.0.0 Depends R (>= 3.5.0), grid, Homo.sapiens, TxDb.Hsapiens.UCSC.hg38.knownGene,
More informationPackage OmicKriging. August 29, 2016
Type Package Title Poly-Omic Prediction of Complex TRaits Version 1.4.0 Date 2016-03-03 Package OmicKriging August 29, 2016 Author Hae Kyung Im, Heather E. Wheeler, Keston Aquino Michaels, Vassily Trubetskoy
More informationRelative Constraints as Features
Relative Constraints as Features Piotr Lasek 1 and Krzysztof Lasek 2 1 Chair of Computer Science, University of Rzeszow, ul. Prof. Pigonia 1, 35-510 Rzeszow, Poland, lasek@ur.edu.pl 2 Institute of Computer
More informationPackage EBglmnet. January 30, 2016
Type Package Package EBglmnet January 30, 2016 Title Empirical Bayesian Lasso and Elastic Net Methods for Generalized Linear Models Version 4.1 Date 2016-01-15 Author Anhui Huang, Dianting Liu Maintainer
More informationPackage MultiMeta. February 19, 2015
Type Package Package MultiMeta February 19, 2015 Title Meta-analysis of Multivariate Genome Wide Association Studies Version 0.1 Date 2014-08-21 Author Dragana Vuckovic Maintainer Dragana Vuckovic
More informationPRSice: Polygenic Risk Score software - Vignette
PRSice: Polygenic Risk Score software - Vignette Jack Euesden, Paul O Reilly March 22, 2016 1 The Polygenic Risk Score process PRSice ( precise ) implements a pipeline that has become standard in Polygenic
More informationPackage SimGbyE. July 20, 2009
Package SimGbyE July 20, 2009 Type Package Title Simulated case/control or survival data sets with genetic and environmental interactions. Author Melanie Wilson Maintainer Melanie
More informationClick on "+" button Select your VCF data files (see #Input Formats->1 above) Remove file from files list:
CircosVCF: CircosVCF is a web based visualization tool of genome-wide variant data described in VCF files using circos plots. The provided visualization capabilities, gives a broad overview of the genomic
More informationFVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS
FVGWAS- 3.0 Manual Hongtu Zhu @ UNC BIAS Chao Huang @ UNC BIAS Nov 8, 2015 More and more large- scale imaging genetic studies are being widely conducted to collect a rich set of imaging, genetic, and clinical
More informationPackage asymld. August 29, 2016
Package asymld August 29, 2016 Type Package Title Asymmetric Linkage Disequilibrium (ALD) for Polymorphic Genetic Data Version 0.1 Date 2016-01-29 Author Richard M. Single Maintainer Richard M. Single
More informationRLMM - Robust Linear Model with Mahalanobis Distance Classifier
RLMM - Robust Linear Model with Mahalanobis Distance Classifier Nusrat Rabbee and Gary Wong June 13, 2018 Contents 1 Introduction 1 2 Instructions for Genotyping Affymetrix Mapping 100K array - Xba set
More informationApplication of Spectral Clustering Algorithm
1/27 Application of Spectral Clustering Algorithm Danielle Middlebrooks dmiddle1@math.umd.edu Advisor: Kasso Okoudjou kasso@umd.edu Department of Mathematics University of Maryland- College Park Advance
More informationPolymorphism and Variant Analysis Lab
Polymorphism and Variant Analysis Lab Arian Avalos PowerPoint by Casey Hanson Polymorphism and Variant Analysis Matt Hudson 2018 1 Exercise In this exercise, we will do the following:. 1. Gain familiarity
More informationAssociation Analysis of Sequence Data using PLINK/SEQ (PSEQ)
Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Copyright (c) 2018 Stanley Hooker, Biao Li, Di Zhang and Suzanne M. Leal Purpose PLINK/SEQ (PSEQ) is an open-source C/C++ library for working
More informationQUICKTEST user guide
QUICKTEST user guide Toby Johnson Zoltán Kutalik December 11, 2008 for quicktest version 0.94 Copyright c 2008 Toby Johnson and Zoltán Kutalik Permission is granted to copy, distribute and/or modify this
More informationGlobal modelling of air pollution using multiple data sources
Global modelling of air pollution using multiple data sources Matthew Thomas M.L.Thomas@bath.ac.uk Supervised by Dr. Gavin Shaddick In collaboration with IHME and WHO June 14, 2016 1/ 1 MOTIVATION Air
More informationCircosVCF workshop, TAU, 9/11/2017
CircosVCF exercise In this exercise, we will create and design circos plots using CircosVCF. We will use vcf files of a published case "X-linked elliptocytosis with impaired growth is related to mutated
More informationUpdates and Case Study
Archipelago Measurement Infrastructure Updates and Case Study Young Hyun CAIDA ISMA 2010 AIMS Workshop Feb 9, 2010 2 Outline Introduction Monitor Deployment Measurements & Collaborations Tools Development
More informationLEA: An R Package for Landscape and Ecological Association Studies
LEA: An R Package for Landscape and Ecological Association Studies Eric Frichot and Olivier François Université Grenoble-Alpes, Centre National de la Recherche Scientifique, TIMC-IMAG UMR 5525, Grenoble,
More informationHidden Markov Models in the context of genetic analysis
Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi
More informationJMP Clinical. Release Notes. Version 5.0
JMP Clinical Version 5.0 Release Notes Creativity involves breaking out of established patterns in order to look at things in a different way. Edward de Bono JMP, A Business Unit of SAS SAS Campus Drive
More informationMaximizing Public Data Sources for Sequencing and GWAS
Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda
More informationDeltaGen: Quick start manual
1 DeltaGen: Quick start manual Dr. Zulfi Jahufer & Dr. Dongwen Luo CONTENTS Page Main operations tab commands 2 Uploading a data file 3 Matching variable identifiers 4 Data check 5 Univariate analysis
More informationTo finish the current project and start a new project. File Open a text data
GGEbiplot version 5 In addition to being the most complete, most powerful, and most user-friendly software package for biplot analysis, GGEbiplot also has powerful components for on-the-fly data manipulation,
More informationCHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM
96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays
More information11/17/2009 Comp 590/Comp Fall
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected
More informationPackage detectruns. February 6, 2018
Type Package Package detectruns February 6, 2018 Title Detect Runs of Homozygosity and Runs of Heterozygosity in Diploid Genomes Version 0.9.5 Date 2018-02-05 Detection of runs of homozygosity and of heterozygosity
More informationPackage inversion. R topics documented: July 18, Type Package. Title Inversions in genotype data. Version
Package inversion July 18, 2013 Type Package Title Inversions in genotype data Version 1.8.0 Date 2011-05-12 Author Alejandro Caceres Maintainer Package to find genetic inversions in genotype (SNP array)
More informationGenViewer Tutorial / Manual
GenViewer Tutorial / Manual Table of Contents Importing Data Files... 2 Configuration File... 2 Primary Data... 4 Primary Data Format:... 4 Connectivity Data... 5 Module Declaration File Format... 5 Module
More informationThe Economist rate card 2017 (GBP)
The Economist rate card 2017 (GBP) The Economist newspaper, Digital Editions app, Snapchat, and Global Business Review The Economist allows you to reach our influential audience through print and our award
More informationThe LDheatmap Package
The LDheatmap Package May 6, 2006 Title Graphical display of pairwise linkage disequilibria between SNPs Version 0.2-1 Author Ji-Hyung Shin , Sigal Blay , Nicholas Lewin-Koh
More informationMACAU User Manual. Xiang Zhou. March 15, 2017
MACAU User Manual Xiang Zhou March 15, 2017 Contents 1 Introduction 2 1.1 What is MACAU...................................... 2 1.2 How to Cite MACAU................................... 2 1.3 The Model.........................................
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationPackage RPMM. August 10, 2010
Package RPMM August 10, 2010 Type Package Title Recursively Partitioned Mixture Model Version 1.06 Date 2009-11-16 Author E. Andres Houseman, Sc.D. Maintainer E. Andres Houseman
More informationPackage LEA. December 24, 2017
Package LEA December 24, 2017 Title LEA: an R package for Landscape and Ecological Association Studies Version 2.0.0 Date 2017-04-03 Author , Olivier Francois
More informationThe Economist rate card 2017 (USD)
The Economist rate card 2017 (USD) The Economist newspaper, Digital Editions app, Snapchat, and Global Business Review The Economist allows you to reach our influential audience through print and our award
More informationPackage kofnga. November 24, 2015
Type Package Package kofnga November 24, 2015 Title A Genetic Algorithm for Fixed-Size Subset Selection Version 1.2 Date 2015-11-24 Author Mark A. Wolters Maintainer Mark A. Wolters
More information500K Data Analysis Workflow using BRLMM
500K Data Analysis Workflow using BRLMM I. INTRODUCTION TO BRLMM ANALYSIS TOOL... 2 II. INSTALLATION AND SET-UP... 2 III. HARDWARE REQUIREMENTS... 3 IV. BRLMM ANALYSIS TOOL WORKFLOW... 3 V. RESULTS/OUTPUT
More informationbimm vignette Matti Pirinen & Christian Benner University of Helsinki November 15, 2016
bimm vignette Matti Pirinen & Christian Benner University of Helsinki November 15, 2016 1 Introduction bimm is a software package to efficiently estimate variance parameters of a bivariate lineax mixed
More informationPackage ibbig. R topics documented: December 24, 2018
Type Package Title Iterative Binary Biclustering of Genesets Version 1.26.0 Date 2011-11-23 Author Daniel Gusenleitner, Aedin Culhane Package ibbig December 24, 2018 Maintainer Aedin Culhane
More informationGenetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland
Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming
More informationarxiv: v2 [q-bio.qm] 17 Nov 2013
arxiv:1308.2150v2 [q-bio.qm] 17 Nov 2013 GeneZip: A software package for storage-efficient processing of genotype data Palmer, Cameron 1 and Pe er, Itsik 1 1 Center for Computational Biology and Bioinformatics,
More informationDrug versus Disease (DrugVsDisease) package
1 Introduction Drug versus Disease (DrugVsDisease) package The Drug versus Disease (DrugVsDisease) package provides a pipeline for the comparison of drug and disease gene expression profiles where negatively
More informationSTAT 540 Computing in Statistics
STAT 540 Computing in Statistics Introduces programming skills in two important statistical computer languages/packages. 30-40% R and 60-70% SAS Examples of Programming Skills: 1. Importing Data from External
More informationMidterm I Exam Principles of Imperative Computation André Platzer Ananda Gunawardena. February 23, Name: Andrew ID: Section:
Midterm I Exam 15-122 Principles of Imperative Computation André Platzer Ananda Gunawardena February 23, 2012 Name: Andrew ID: Section: Instructions This exam is closed-book with one sheet of notes permitted.
More informationPackage MOJOV. R topics documented: February 19, 2015
Type Package Title Mojo Variants: Rare Variants analysis Version 1.0.1 Date 2013-02-25 Author Maintainer Package MOJOV February 19, 2015 A package for analysis between rare variants
More informationPackage PedCNV. February 19, 2015
Type Package Package PedCNV February 19, 2015 Title An implementation for association analysis with CNV data. Version 0.1 Date 2013-08-03 Author, Sungho Won and Weicheng Zhu Maintainer
More informationPackage PTE. October 10, 2017
Type Package Title Personalized Treatment Evaluator Version 1.6 Date 2017-10-9 Package PTE October 10, 2017 Author Adam Kapelner, Alina Levine & Justin Bleich Maintainer Adam Kapelner
More informationPackage ukbtools. February 5, 2018
Version 0.10.1 Title Manipulate and Explore UK Biobank Data Package ukbtools February 5, 2018 Maintainer Ken Hanscombe A set of tools to create a UK Biobank
More informationMicrosoft IT Leverages its Compute Service to Virtualize SharePoint 2010
Microsoft IT Leverages its Compute Service to Virtualize SharePoint 2010 Published: June 2011 The following content may no longer reflect Microsoft s current position or infrastructure. This content should
More informationTree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationGenetic type 1 Error Calculator (GEC)
Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development
More informationBEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010
BEAGLECALL 1.0 Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington 15 November 2010 BEAGLECALL 1.0 P a g e i Contents 1 Introduction... 1 1.1 Citing BEAGLECALL...
More informationA GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS
A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS Jim Gasvoda and Qin Ding Department of Computer Science, Pennsylvania State University at Harrisburg, Middletown, PA 17057, USA {jmg289, qding}@psu.edu
More informationPackage QCEWAS. R topics documented: February 1, Type Package
Type Package Package QCEWAS February 1, 2019 Title Fast and Easy Quality Control of EWAS Results Files Version 1.2-2 Date 2019-02-01 Author Peter J. van der Most, Leanne K. Kupers, Ilja Nolte Maintainer
More informationBiology Project 1
Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related
More information