Genetic type 1 Error Calculator (GEC)

Similar documents
KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

Step-by-Step Guide to Basic Genetic Analysis

Polymorphism and Variant Analysis Lab

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

CircosVCF workshop, TAU, 9/11/2017

Rice Imputation Server tutorial

Emile R. Chimusa Division of Human Genetics Department of Pathology University of Cape Town

Genetic Analysis. Page 1

Importing and Merging Data Tutorial

Download PLINK from

Data input vignette Reading genotype data in snpstats

Release Note. Agilent Genomic Workbench 6.5 Lite

PhD: a web database application for phenotype data management

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1

PRSice: Polygenic Risk Score software v1.22

REAP Software Documentation

Package SMAT. January 29, 2013

GWAS Exercises 3 - GWAS with a Quantiative Trait

GMDR User Manual Version 1.0

PriVar documentation

User Manual ixora: Exact haplotype inferencing and trait association

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

Spotter Documentation Version 0.5, Released 4/12/2010

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.4 Graphical User Interface (GUI) Manual

SKAT Package. Seunggeun (Shawn) Lee. July 21, 2017

Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016

MACAU User Manual. Xiang Zhou. March 15, 2017

PLATO User Guide. Current version: PLATO 2.1. Last modified: September Ritchie Lab, Geisinger Health System

GWAsimulator: A rapid whole-genome simulation program

Documentation for FCgene

PBAP Version 1 User Manual

Optimising PLINK. Weronika Filinger. September 2, 2013

Package SimGbyE. July 20, 2009

BOLT-LMM v1.2 User Manual

arxiv: v2 [q-bio.qm] 17 Nov 2013

PBAP Version 1 User Manual

EMIM: Estimation of Maternal, Imprinting and interaction effects using Multinomial modelling

BEAGLECALL 1.0. Brian L. Browning Department of Medicine Division of Medical Genetics University of Washington. 15 November 2010

MPG NGS workshop I: Quality assessment of SNP calls

Network Based Models For Analysis of SNPs Yalta Opt

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Step-by-Step Guide to Relatedness and Association Mapping Contents

myvcf Documentation Release latest

LD vignette Measures of linkage disequilibrium

PRSice: Polygenic Risk Score software - Vignette

Package MultiMeta. February 19, 2015

HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data

The fgwas software. Version 1.0. Pennsylvannia State University

Recalling Genotypes with BEAGLECALL Tutorial

Release Note. Agilent Genomic Workbench Standard Edition

User Guide for Tn-seq analysis software (TSAS) by

GBS Bioinformatics Pipeline(s) Overview

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

Package lodgwas. R topics documented: November 30, Type Package

Package RobustSNP. January 1, 2011

Release Notes. JMP Genomics. Version 4.0

Handling sam and vcf data, quality control

JMP Genomics. Release Notes. Version 6.0

arxiv: v1 [q-bio.gn] 16 Nov 2017

Estimating. Local Ancestry in admixed Populations (LAMP)

The Lander-Green Algorithm in Practice. Biostatistics 666

FORMAT PED PHENO Software Documentation

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

MQLS-XM Software Documentation

QTX. Tutorial for. by Kim M.Chmielewicz Kenneth F. Manly. Software for genetic mapping of Mendelian markers and quantitative trait loci.

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

GSCAN GWAS Analysis Plan, v GSCAN GWAS ANALYSIS PLAN, Version 1.0 October 6, 2015

Package icnv. R topics documented: March 8, Title Integrated Copy Number Variation detection Version Author Zilu Zhou, Nancy Zhang

A short manual for LFMM (command-line version)

Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data

Introduction to Evolutionary Computation

SE - Deployment to Installed Polarion. SE - Execution from Workspace. SE - Configuration.

Package LGRF. September 13, 2015

HPC methods for hidden Markov models (HMMs) in population genetics

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

User s Guide. Version 2.2. Semex Alliance, Ontario and Centre for Genetic Improvement of Livestock University of Guelph, Ontario

Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling

Sequence Genotyper Reference Guide

ToCatchAThief c ryan campbell & jenn coughlan 7/23/2018

Package inversion. R topics documented: July 18, Type Package. Title Inversions in genotype data. Version

Click on "+" button Select your VCF data files (see #Input Formats->1 above) Remove file from files list:

Small example of use of OmicABEL

Introduction to Hail. Cotton Seed, Technical Lead Tim Poterba, Software Engineer Hail Team, Neale Lab Broad Institute and MGH

Step-by-Step Guide to Advanced Genetic Analysis

QUICKTEST user guide

The LDheatmap Package

i2b2 Workbench Developer s Guide: Eclipse Neon & i2b2 Source Code

Package globalgsa. February 19, 2015

LFMM version Reference Manual (Graphical User Interface version)

Tutorial. Batching of Multi-Input Workflows. Sample to Insight. November 21, 2017

The fgwas Package. Version 1.0. Pennsylvannia State University

SNPViewer Documentation

PMOD Installation on MacOSX Systems

MAGMA manual (version 1.06)

Applications of a generic model of genomic variations functional analysis

SFDR (Stratified False Discovery Rate) Software Documentation. Version 1.6 Feb 7, 2010

Data formats in GWASTools

Transcription:

Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development and Growth; and Genome Research Centre, the University of Hong Kong, Pokfulam, Hong Kong

Contents 1. Introduction... 3 2. Installation... 4 2.1 Installation of Java Runtime Environment (JRE)... 4 2.2 Installation of GEC... 4 3. Input files... 4 3.1 Input files used to obtain LD information of SNPs... 4 3.2. Input files for multiple testing... 5 4. Examples... 6 4.1 Estimate effective number of independent test... 6 4.2 Do multiple-testing in a set of p-values... 8

1. Introduction The Genetic Type I error calculator (GEC) is a Java-based application developed to address multiple-testing issue with dependent Single-nucleotide polymorphisms (SNPs). The core part is a new measure of effective number of independent tests Me [Hum Genet. 2012 May;131(5):747-56.], which is more roust that available methods [Figure 1]. Based on this new measure, several popular multiple-testing methods including Bonferroni, Holm, Simes correction was improved to evaluate significance level of SNP p-values in genome-wide association studies. A standalone version of this tool was provided to process large datasets on users local computers and an on-line version (GEC, http://statgenpro.psychiatry.hku.hk/gec/estimateb.php?function=bonferroni) was made for users to quick handle a SMALL dataset conveniently. Figure 1: Box plot of MVN-derived FWERs for 5 different methods. For each method, the nominal FWER was set to be 0.05. The bottom and top of each box mark the 25th and 75th percentile, respectively, and the band in the box denotes the 50th percentile. The lines above and below each box denote the upper and lower 1.5 interquartile range (IQR). This result indicated the our new measure (The proposed) is more robust (See detailed description about the analysis in our paper) When may you need GEC to help you? I want to know the independent number of tests or redundancy degree in a set of dependent SNPs. the p-value thresholds to declare significant SNPs for association in a set of dependent p-values.

2. Installation 2.1 Installation of Java Runtime Environment (JRE) The JRE is required to run GEC on any operating systems (OS). It can be downloaded from http://java.sun.com/javase/downloads/index.jsp for free. 2.2 Installation of GEC GEC has not had an installation wizard by far. After downloaded from our website and decompressed, it can be launched through a command, java -jar Xms256m Xmx1300m "./GEC.jar <arguments >, in a command prompt window provided by OS. In the command, -Xms<size> and -Xmx<size> set the initial and maximum Java heap sizes for GEC respectively. A larger maximum heap size can speed up the process of analysis. A higher setting like Xmx1300m is suggested dealing with large number of SNPs, say more than 2,000,000. The number, however, should be less than the size of physical memory. 3. Input files 3.1 Input files used to obtain LD information of SNPs GEC now is able to recognize dataset containing LD information of SNPs in 4 different formats. Use can choose any one of appropriate formats. 3.1.1 HapMap LD dataset format. Uses can download the LD data from the HapMap website (http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/latest/) as input of GEC without any modification. Notably, it is not necessary to uncompress the downloaded file form HapMap because GEC is able to recognize the.gz compressed format. 3.1.2 Plink Binary format GEC can directly read genotypes in the binary format generated by Plink(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed), which compressed format and can be stored and processed more efficiently. GEC will calculate the genotypic correlation to approximate the LD degree between SNPs. The Plink binary file set always includes three linked files *.fam, *.bim and *.bed, which should be put in the same folder.

3.1.3 Linkage format Genotypes in the linkage format are also a valid input of GEC. The linkage file set include two linked files: Pedigree file and Map file. A detailed description about this format can be referred to http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped and http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#map. The following is a simple example. GEC is also able to recognize the.gz compressed format of these files. Linkage Pedigree Format (Example): 1 100 0 0 1 0 c g t g t g c a 2 101 0 0 2 2 c c t t t t c c 3 307 0 0 2 2 c g t g t g c a 4 502 0 0 1 1 0 0 0 0 t t a a 5 501 0 0 1 1 c c t t 0 0 c c 6 306 0 0 1 1 g g g g g g a a The first five columns indicate the Pedigree ID, Individual ID, Father ID, Mother ID, and Gender respectively. The sixth column is for phenotype and the remaining columns are for genotypes. Genotypes must be denoted by the standard single nucleotide symbols (a, t, g and c), which are case-insensitive. Missing genotypes are indicated by 0 0. Hint: In the current version, GEC assume all subjects are independent when calculating the genotypic correlation between SNPs. So please do not input sample with genetically related individuals. Linkage Map File Format (Example): 1 rs2980300 0 775852 1 rs10907175 0 1120590 1 rs2887286 0 1145994 1 rs307378 0 1258710 1 rs7540231 0 1495898 The map file describes the SNP information with genotypes in the pedigree file. Four attributes are required, chromosome, rsid, genetic map, and physical position on the human genome assembly. The columns of chromosome, rsid and physical position are required. 3.1.4 Haplotype data of 1000 Genome project in VCF files This is a simple and easy format of phased haplotype data provided by us http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html. You can download the ancestry matched VCF data to account for LD in your own dataset. 3.2. Input files for multiple testing GEC can read p-values and report significant SNPs. The output of many genetic

association tools can be input to GEC without any modification. Users need specify the column order of chromosome, SNP ID, the physical position of SNP and p-values. 4. Examples 4.1 Estimate effective number of independent test 1) By genotype data in conventional linkage format I. on whole genome: --linkage-file D:/tmp/test \ --genome II. in some regions: --linkage-file D:/tmp/test \ --regions chr2:0-233434,chr4:2323-54564554 \ --coordinate-version hg19>hg18 \ Hint:--coordinate-version specifies the reference genome version of physical positions (coordinates) in the linkage map file and -regions respectively. III. in some chromosomes: --linkage-file D:/tmp/test \ --regions chr2,chr4 \ 2) By genotype data in plink binary format I. on whole genome: --plink-binary D:/tmp/test \ --genome II. in some regions: --plink-binary D:/tmp/test \ --regions chr2:0-233434,chr4:2323-54564554 \ --coordinate-version hg19>hg18 \ Hint:--coordinate-version specifies the reference genome version of physical positions (coordinates) in the plink map file and -regions respectively. III. in some chromosomes: --plink-binary D:/tmp/test \ --regions chr2,chr4 \ 3) By genotypes from 1000 Genomes Project VCF format I. on whole genome:

--vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --genome \ Hint: 1. We have provided a set of compiled 1KG 1000 Genomes data, http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html. 2. _CHROM_ is a variable standing for the chromosome name II. in some regions: --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --regions chr2:0-233434,chr4:2323-54564554 \ --coordinate-version hg19>hg18 \ physical positions (coordinates) in the vcf file and -regions respectively. III. in some chromosomes: --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --regions chr2,chr4 \ 4) By HapMap LD data I. on whole genome: --ld-file D:/tmp/CEU/ld_chr_CHROM CEU.txt.gz \ --genome \ II. in some regions: --ld-file D:/tmp/CEU/ld_chr_CHROM CEU.txt.gz \ --regions chr2:0-233434,chr4:2323-54564554 \ --coordinate-version hg19>hg18 \ physical positions (coordinates) in the ld file and -regions respectively. IV. in some chromosomes: --ld-file D:/tmp/CEU/ld_chr_CHROM CEU.txt.gz \ --regions chr2,chr4 \

5) By Haploype data with the format the same as MACH I. on whole genome: --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --genome \ II. in some regions: --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --regions chr2:0-233434,chr4:2323-54564554 \ --coordinate-version hg19>hg18 \ Note: _CHROM_ is a variable standing for the chromosome name physical positions (coordinates) in the haplotype map file and -regions respectively. III. in some chromosomes: --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --regions chr2,chr4 \ 4.2 Do multiple-testing in a set of p-values 6) By genotype data in conventional linkage format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --linkage-file D:/tmp/test \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --coordinate-version hg19-hg18 \ --pvalue-column 7 \ Hint: --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. physical positions (coordinates) in the linkage map file and p value file respectively. 7) By genotype data in plink binary format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --plink-binary D:/tmp/test \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \

--marker-position-column 3 \ --coordinate-version hg19-hg18 \ --pvalue-column 7 \ Hint: --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. physical positions (coordinates) in the plink map file and p value file respectively. 8) By genotypes from 1000 Genomes Project VCF format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ Hint: We have provided a set of compiled 1KG 1000 Genomes data, http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html. _CHROM_ is a variable standing for the chromosome name --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. physical positions (coordinates) in the vcf file and p value file respectively. If the versions of coordinates are identical, you can put the same version ID, say, hg19-hg19. 9) By HapMap LD data java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --ld-file D:/tmp/CEU/ld_chr_CHROM CEU.txt.gz \(Assume the reference genome version is hg18) --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. physical positions (coordinates) in the ld file and p value file respectively. 10) By Haploype data with the format the same as MACH java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \

--pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. physical positions (coordinates) in the haplotype map file and p value file respectively.