SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data SYNOPSIS SEQGWAS [--sfile] [--chr] OPTIONS Option Default Description --sfile specification.txt Select a specification file --chr Select a chromosome DESCRIPTION SEQGWAS is a command-line program written in C/C++ for integrative analysis of sequencing and GWAS data. SEQGWAS produces all commonly used gene-level tests, including the burden test, variable threshold (VT) test, and sequence-kernel association test (SKAT), all of which are based on the score statistic for assessing the effects of individual variants on the trait of interest. SEQGWAS calculates the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for non-sequenced subjects, and constructs a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, so that the corresponding association tests always have correct type I error. We are working intensely to improve the capabilities of SEQGWAS, so please check back frequently for updates. INPUT FILES Specification File REGRESSION_MODEL = linear #linear/logistic SUBJECT_FILE =.//subject.dat SUBJECT_FILE_HEADER = TRUE SUBJECT_PHENOTYPE_COLUMN = 4 SUBJECT_COVARIATE_COLUMN = 2 3 SUBJECT_SEQUENCED_INDICATOR_COLUMN = 5 # optional VARIANT_FILE =.//variant_chr.dat 1

VARIANT_FILE_HEADER = TRUE VARIANT_ID_COLUMN = 2 VARIANT_POS_COLUMN = 1 VARIANT_FREQ_COLUMN = 5 VARIANT_RSQ_COLUMN = 8 # optional DOSAGE_FILE =.//dosage_chr.dat DOSAGE_FILE_HEADER = FALSE DOSAGE_FILE_SKIP_COLUMNS = 2 ANNOTATION_FILE =.//annotation_chr.dat ANNOTATION_FILE_HEADER = FALSE ANNOTATION_TYPE = SNP # SNP/gene ANNOTATION_POS_COLUMN = 2 ANNOTATION_ACCESSION_COLUMN = 3 ANNOTATION_FUNCTION_COLUMN = 4 ANNOTATION_GENE_COLUMN = 5 ANNOTATION_ID_COLUMN = 6 OUTPUT_FILE = results_chr.out MAF_CUTOFF = 0.05 The file describes the input/output files and the program parameters. The syntax follows KEYWORD = value1 [value2 ] with spaces around =. All the following lines are required unless otherwise stated as optional. REGRESSION_MODEL = linear/logistic Specify the regression model for genotype-phenotype association. SUBJECT_FILE = full_pathname SUBJECT_FILE_HEADER = TRUE/FALSE SUBJECT_PHENOTYPE_COLUMN = num Specify the column (starting with number 1) to be used as the phenotype. SUBJECT_COVARIATE_COLUMN = num_1 [num_2 ] Specify column(s) in the subject file to be used as covariates in the regression model. Optional. SUBJECT_SEQUENCED_INDICATOR_COLUMN = num Specify the column to be used as the indicator of whether the subject is sequenced. DOSAGE_FILE = prefix affix Specify the prefix and affix of the pathname. The program will insert the chromosome number (single digit for 1-9 and two digits for 10-23), specified by -chr, to obtain the 2

full pathname. For example, for the two strings in the example specification file, the dosage file for chromosome 1 is accessed through the pathname:.//dosage_chr1.dat DOSAGE_FILE_HEADER = TRUE/FALSE DOSAGE_FILE_SKIP_COLUMNS = num Skip the first num columns. VARIANT_FILE = prefix affix VARIANT_FILE_HEADER = TRUE/FALSE VARIANT_ID_COLUMN = num VARIANT_POS_COLUMN = num VARIANT_FREQ_COLUMN = num VARIANT_RSQ_COLUMN = num Optional. If not specified, the Rsq measure will be calculated internally. ANNOTATION_FILE = prefix affix ANNOTATION_FILE_HEADER = TRUE/FALSE ANNOTATION_TYPE = SNP Specify the format of the annotation file. Currently, only the value SNP is allowed. ANNOTATION_POS_COLUMN = num ANNOTATION_ACCESSION_COLUMN = num ANNOTATION_FUNCTION_COLUMN = num ANNOTATION_GENE_COLUMN = num ANNOTATION_ID_COLUMN = num OUTPUT_FILE = prefix affix MAF_CUTOFF = MAF_cutoff Only variants with MAFs MAF_CUTOFF are considered for analysis. All the data files are space- or tab-delimited and can allow for one header row (or no header row). Subject File GWAS_ID AfrIA age BMI sequenced 700001 0.779796662 74 33.17012 0 700002 0.774728994 76 32.4515 0 700003 0.765335395 59 22.94974 0 3

The file provides information on the phenotype, covariates, and sequencing indicator (indicating whether a subject is sequenced or not) for all subjects in the GWAS cohort. Each row is specific to an individual. The column for the phenotype and the sequencing indicator is required and those for the subject identifier and covariates are optional. In a case-control study, the disease variable should be coded 0/1 to represent unaffected/affected. Missing data are denoted as. or NA. Variant File pos SNP Al1 Al2 Freq1 MAF AvgCall Rsq 10862587 snp.1218005 C C 1 0 1 0 10862595 snp.1218006 A A 1 0 1 0 10862598 snp.1218007 C T 0.99314 0.00686 0.99314 4e-05 The file provides information on the sequencing-identified variants as well as GWAS SNPs on the particular chromosome specified by --chr. Each row is specific to a SNP; the rows must be in genomic order. The columns for the position, SNP identifier, coding-allele frequency are required and the one for the Rsq measurement is optional. If the position of a SNP is missing, it should be denoted as. or NA and that SNP will be excluded from analysis. The SNP position will be used to link the SNPs in the variant and the annotation files, and thus should be comparible. Dosage File 700001 1.996 1.967 1.965 700002 1.986 1.976 1.976 700003 1.974 1.867 1.853 The file provides (imputed) genotypic dosages for all the subjects in the GWAS cohort. Each row pertains to a subject; the order of subjects must align with their orders in the subject file. Each column pertains to a SNP; the order of SNPs must align with their orders in the variant file. This file allows arbitraty number of columns in front of the main data body. 4

Annotation File 21 44473956 NM_000071 utr-3 CBS snp.1227710 21 44473963 NM_000071 utr-3 CBS snp.1227711 21 44473980 NM_000071 utr-3 CBS snp.1227714 21 44474003 NM_000071 missense CBS snp.1227716 The file provides annotation information for the SNPs. The current version of SEQGWAS (v1.0) only allows the annotation format for SNPs. Specifically, each row pertains to a SNP; the rows must be grouped by the accession number. OUTPUT Output File chr index gene accession n_var Rsq_gene p_t1 p_t5 p_v p_skat 21 309 LIPI NM_198996 9 0.7026 8.74e-1 3.83e-1 7.13e-1 5.03e-1 21 311 TPTE NM_199259 31 0.0012 2.99e-1 4.00e-1 2.67e-1 2.24e-1 The file contains information on the number of variants included in each gene (n_var), the gene-averged Rsq (Rsq_gene), and the p-values of the burden test with the MAF threshold of 1% (T1) and 5% (T1), the variable threshold test (VT) and SKAT. EXAMPLE Download and unzip the software package. Enter the command $ SEQGWAS -sfile specification.txt -chr 21 to obtain the results given in results_chr21.out. REFERENCE Hu, Y.J., Li, Y., Auer, P,L. and Lin, D.Y. Integrative Analysis of Sequencing and GWAS Data for Rare Variant Associations. Submitted. 5

VERSION HISTORY v1.0 2014/03/04 First version released. 6