FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

Similar documents
Step-by-Step Guide to Advanced Genetic Analysis

Ricopili: Introdution. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

A User s Guide to Graphical-Model-based Multivariate Analysis

Package MultiMeta. February 19, 2015

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data

The Analysis of RAD-tag Data for Association Studies

PRSice: Polygenic Risk Score software - Vignette

A Review of Statistical Methods in Imaging Genetics

Step-by-Step Guide to Basic Genetic Analysis

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

Single Subject Demo Data Instructions 1) click "New" and answer "No" to the "spatially preprocess" question.

A short manual for LFMM (command-line version)

GWAS Exercises 3 - GWAS with a Quantiative Trait

Package GWAF. March 12, 2015

Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation

Polymorphism and Variant Analysis Lab

CTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Pattern Recognition for Neuroimaging Toolbox: PRoNTo

MAGA: Meta-Analysis of Gene-level Associations

Emile R. Chimusa Division of Human Genetics Department of Pathology University of Cape Town

Package lodgwas. R topics documented: November 30, Type Package

Release Notes. JMP Genomics. Version 4.0

FSL Workshop Session 3 David Smith & John Clithero

User Manual. Ver. 3.0 March 19, 2012

Step-by-Step Guide to Relatedness and Association Mapping Contents

The Imprinting Model

Maximizing Public Data Sources for Sequencing and GWAS

Package GEM. R topics documented: January 31, Type Package

Spotter Documentation Version 0.5, Released 4/12/2010

LEA: An R Package for Landscape and Ecological Association Studies

The fgwas Package. Version 1.0. Pennsylvannia State University

Rice Imputation Server tutorial

MAGMA manual (version 1.06)

LFMM version Reference Manual (Graphical User Interface version)

Expression Analysis with the Advanced RNA-Seq Plugin

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Importing and Merging Data Tutorial

MACAU User Manual. Xiang Zhou. March 15, 2017

snpqc an R pipeline for quality control of Illumina SNP data

Documentation for imcalc (SPM 5/8/12) Robert J Ellis

PRSice: Polygenic Risk Score software v1.22

Package REGENT. R topics documented: August 19, 2015

User Manual ixora: Exact haplotype inferencing and trait association

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

Genetic Analysis. Page 1

USER S MANUAL FOR THE AMaCAID PROGRAM

Genome-Wide Association Study Using

SKAT Package. Seunggeun (Shawn) Lee. July 21, 2017

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Release Note. Agilent Genomic Workbench 6.5 Lite

Package MOJOV. R topics documented: February 19, 2015

GraphVar: A user-friendly toolbox for comprehensive graph analyses of functional brain connectivity.

What can we do today that we couldn t do before.

ChIP-seq Analysis Practical

PROMO 2017a - Tutorial

Data Currently Available (And How to Access It) Chance Hohensee Data Training September 9, 2016

Package EMLRT. August 7, 2014

Estimating Variance Components in MMAP

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Bioinformatics - Homework 1 Q&A style

The fgwas software. Version 1.0. Pennsylvannia State University

500K Data Analysis Workflow using BRLMM

Large scale SNP data management and integration in breeding programs

Introduction to Systems Biology II: Lab

Supporting Figures Figure S1. FIGURE S2.

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

MAGMA manual (version 1.05)

Package ridge. R topics documented: February 15, Title Ridge Regression with automatic selection of the penalty parameter. Version 2.

RnBeadsDJ A Quickstart Guide to the RnBeads Data Juggler

Package DSPRqtl. R topics documented: June 7, Maintainer Elizabeth King License GPL-2. Title Analysis of DSPR phenotypes

Supplementary methods

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

WEKA homepage.

Recalling Genotypes with BEAGLECALL Tutorial

GPU Data Mining in Neuroimaging Genomics

Intro to NGS Tutorial

Genetic type 1 Error Calculator (GEC)

HPC Course Session 3 Running Applications

Using CLC Genomics Workbench on Turing

Package LEA. April 23, 2016

GMDR User Manual Version 1.0

Neuroimaging Group Pipeline Quick Start Manual

How to Reveal Magnitude of Gene Signals: Hierarchical Hypergeometric Complementary Cumulative Distribution Function

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Neuroimaging and mathematical modelling Lesson 2: Voxel Based Morphometry

The Lander-Green Algorithm in Practice. Biostatistics 666

QUICKTEST user guide

Package Eagle. January 31, 2019

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

JMP Genomics. Release Notes. Version 6.0

Group (Level 2) fmri Data Analysis - Lab 4

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Alpha 1 i2b2 User Guide

Differential Expression Analysis at PATRIC

A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals

Package snm. July 20, 2018

srna Detection Results

Group Level Imputation of Statistic Maps, Version 1.0. User Manual by Kenny Vaden, Ph.D. musc. edu)

Transcription:

FVGWAS- 3.0 Manual Hongtu Zhu @ UNC BIAS Chao Huang @ UNC BIAS Nov 8, 2015 More and more large- scale imaging genetic studies are being widely conducted to collect a rich set of imaging, genetic, and clinical data to detect putative genes for complexly inherited neuropsychiatric and neurodegenerative disorders. Several major big- data challenges arise from testing genome- wide associations with signals at millions of locations in the brain from thousands of subjects. The aim of this software is to efficiently carry out whole- genome analyses of whole- brain data. FVGWAS consists of three components including a heteroscedastic linear model, a global sure independence screening (GSIS) procedure, and a detection procedure based on wild bootstrap methods. In the latest version, we implemented the option to divide and conquer large genotype datasets, which further improves the efficiency. Our FVGWAS may be a valuable statistical toolbox for large- scale imaging genetic analysis as the field is rapidly advancing with ultra- high- resolution imaging and whole- genome sequencing. 1. Schematic overview of FVGWAS

2. FVGWAS pipeline description For FVGWAS, we provide two different version: interface and server- based pipeline. If you have a small data set, e.g., ROI data and SNP data from one chromosome, you can use the interface on your local desktop. If you have a large data set, e.g., whole brain imaging data and imputed SNP data, you can use the server- based pipeline which can also help you run all the tasks via parallel computing. 2.1. Interface- based FVGWAS Covariates n*d design matrix. n is the sample size and d is the number of covariates Data Image data (n*v matrix). V is the number of voxels (whole brain analysis) or the number of ROIs (volumetric analysis) SNP Genetic data (n*c matrix). C is the number of SNPs Image index V*1 vector. The index of voxels within the brain mask (Applicable only for whole brain analysis)

Image size The size of image data, e.g., 1*3 vector means 3- D image data (Applicable only for whole brain analysis) N0 The number of top SNPs G The number of bootstrap samples Output directory Set up the directory, which is used to store all the results RUN Start running all the tasks Clear Clear all the input information on the interface 2.2 Server- based FVGWAS Pipeline If you have a big data set and prefer to run FVGWAS on the workstations or servers. There are five MATLAB functions, which can help you accomplish all the tasks in FVGWAS. A. FVGWAS_GSIS_server.m This function is to run global sure independent screening (GSIS) procedure. After running this function, you will get the raw p values for all the loci with MAF no less than 0.05. B. FVGWAS_Manhattanplot_server.m This function is to summarize all the p values and SNP information into one file, which will be used as an input file in plotting Manhattan plot and QQ plot. After running this function, you will get a file including the p values, SNP names, chromosome IDs, and BP values for all loci. C. FVGWAS_Bootstrap_server.m

This function is to run the bootstrapping in the detection procedure. After running this function, you will get the bootstrap test statistics, which will be used to approximate null distribution of the proposed test statistics. In this step, because the bootstrapping requires a lot of memory, we adopt to run part of the code via parallel computing. If you have multiple cores on the server, you can finish this step very efficiently. When running large datasets, it s possible to run out of memory. Please use divide and conquer algorithm if that happened. D. FVGWAS_Detection_server.m This function is used obtain adjusted p values for all loci with MAF no less than 0.05 in the detection procedure. After running this function, you will get the significant locus- voxel pairs and significant clusters for top SNPs found in GSIS step. 2.3. Server- based FVGWAS Pipeline with divide and conquer strategy We implemented a divide and conquer strategy to split large dataset and perform the analysis more efficiently in terms of both time cost and memory cost. There are seven MATLAB functions and one bash script, which can help you accomplish the tasks in FVGWAS. A. FVGWAS_Parameter_DVD.m This function is to set all input information, including the name of image data file, covariate file, genotype file, image index file, image size file, output directory and all required parameters for FVGWAS. B. FVGWAS_GSIS_server_DVD.m This function is to run global sure independent screening (GSIS) procedure. Splitting option is available by specifying snpc, the size of SNP subset. After running this function, you will get the p values for all the loci with MAF no less than 0.05. C. FVGWAS_Manhattanplot_server_DVD.m This function is to summarize all the p values and SNP information into one file, which will be used as an input file in plotting Manhattan plot and QQ plot. After running this function, you will get a file including the p values, SNP names, chromosome IDs, and BP values for all loci. D. split.scp This script is to split genotype dataset into several unjoint subsets of small size. The maximum number of snps in each subset is specified by parameter Bsize.

E. FVGWAS_Bootstrap_Top_server_DVD.m This function is to run the bootstrapping on each split dataset in the detection procedure. After running this function, you will get bootstrap test statistics for SNP subsets. In this step, you are able to run for all subsets simultaneously with much less time & memory cost on each single core. If you have multiple cores on the server, you can finish this step very efficiently. F. FVGWAS_Bootstrap_Cmb_server_DVD.m This function is to combine bootstrapping results from in the detection procedure. After running this function, you will get the bootstrap test statistics for the whole dataset, which will be used to approximate null distribution of the proposed test statistics. G. FVGWAS_Detection_server_DVD.m This function is to run the bootstrapping in the detection procedure. After running this function, you will get the significant locus- voxel pairs and significant clusters for top SNPs found in GSIS step. 2.4. GWAS Results Display This function (FVGWAS_ManPlot.R) is to plot Manhattan plot and QQ plot with R package (qqman). The file generated in function B is treated as the input file in this step. 3. Examples We consider both ROI volumes and RAVENS maps to illustrate the wide applicability of FVGWAS. We carried out two different FVGWAS analyses: one is to use the volumes of 93 ROIs as multivariate phenotypic vectors and the other is to use RAVENS maps as whole- brain phenotypic vectors. In both analyses, we included an intercept, gender, age, whole brain volume, and the top 5 principal component scores in SNPs as the covariates. Then, we tested the additive effect of each of 501,584 SNPs on either 93 ROI volumes or RAVENS maps. For the volumetric analysis, we run all the tasks via the interface pipeline, while for the Ravens map data, we conduct the analysis via server- based pipeline. 3.1. GUI example: ROI volume analysis 1. Download this package and unzip the file;

2. Start Matlab and set the following directory as the working folder C:\Users\Administrator\Desktop\FVGWAS- 3.0 Where C:\Users\Administrator\Desktop is the directory where you put the unzipped folder FVGWAS- 3.0 3. In the folder FVGWAS- 3.0, you can find six different folders and one R script. In the folder DATA, you can find the volumetric data set in the sub- folder volumetric. There are four files in that folder. Covariate_Data.mat includes all the demographic information; Image_Data.mat includes all the ROI thickness information; SNP_Data.txt includes the genetic data for all subjects; and SNP_info.map contains the SNP information including SNP names, chromosome IDs, and BP values for all loci. All the codes are stored in the folder Functions. The interface can be found in the folder GUI. 4. Set GUI as the current folder and type FVGWASgui in command window to launch FVGWAS in MATLAB.

5. Load the data into FVFWAS Click the button Covariates: load Covariate_Data.mat Click the button Data: load Image_Data.mat Click the button SNP: load SNP_Data.txt Set N 0 : 1000 Set G: 1000 Click the button Output directory: e.g../fvgwas- 3.0/Output/volumetric Click the button RUN 6. Output files

GSISresults.mat results of GSIS step, variable pp contains the - log_10{p- values} of all SNP data, and SNP_index contains all the SNP ID which are removed in preprocessing. VoxelclusterandSNP.mat variable rawpvalue is a C*N0 matrix including the raw p- values of top N0 SNPs. variable pv is a C*N0 matrix, including the corrected p- values. 7. Results display (1) Run FVGWAS_Manhattanplot.m in MatLab to generate the GWAS results. Output file: SNP_GSIS.txt. One thing should be noted that, in FVGWAS_Manhattanplot.m, there are two parameters should be modified according to your own settings. mode='volumetric'; volumetric is the name of the folder where the file SNP_GSIS.txt is saved. Flag=0; where 0 for ROI- based imaging data; 1 for voxel- based imaging data (2) Run FVGWAS_ManPlot.R in R to plot the Manhattan plot and QQ plot. One thing should be noted that, in FVGWAS_ManPlot.R, there are two parameters should be modified according to your own settings. statname="volumetric"; volumetric is the name of the folder where the file SNP_GSIS.txt is saved. HomeDir="C:/Users/Administrator/Desktop/FVGWAS- 2.0"; which is the directory the folder FVGWAS- 2.0 is located.

3.2. Server based example: Voxel- wise Analysis In the folder DATA, you can find the RavenMap data set in the sub- folder RavenMap. There are six files in that folder. Covariate_Data.mat includes all the demographic information; Image_Data.mat includes all the ROI thickness information; ImageIndex_Data.mat is a V*1 vector, which is the index of voxels within the brain mask; ImageSize_Data.mat is the size of image data, e.g., 1*3 vector means 3- D image data; SNP_Data.txt includes the genetic data for all subjects; and SNP_info.map contains the SNP information including SNP names, chromosome IDs, and BP values for all loci.all the codes are stored in the folder Functions. The server- based main functions can be found in the folder Server. 1. Set Server as the current folder and run FVGWAS_GSIS_server.m; 2. After Step 1 is finished, run FVGWAS_Bootstrap_server.m 3. After Step 2 is finished, run FVGWAS_Detection_server.m; 4. Result Display (1) Run FVGWAS_Manhattanplot_server.m in MatLab to generate the GWAS results. Output file: SNP_GSIS.txt. One thing should be noted that, in FVGWAS_Manhattanplot_server.m, there are two parameters should be modified according to your own settings. mode='ravenmap'; RavenMap is the name of the folder where the file SNP_GSIS.txt is saved. Flag=1; where 0 for ROI- based imaging data; 1 for voxel- based imaging data (2) Run FVGWAS_ManPlot.R in R to plot the Manhattan plot and QQ plot. One thing should be noted that, in FVGWAS_ManPlot.R, there are two parameters should be modified according to your own settings. statname="ravenmap"; RavenMap is the name of the folder where the file SNP_GSIS.txt is saved. HomeDir="C:/Users/Administrator/Desktop/FVGWAS- 3.0"; which is the directory the folder FVGWAS- 3.0 is located.

3.3. Example using divide and conquer strategy: Voxel- wise Analysis In the folder DATA, you can find the RavenMap data set in the sub- folder RavenMap. There are six files in that folder. Covariate_Data.mat includes all the demographic information; Image_Data.mat includes all the ROI thickness information; ImageIndex_Data.mat is a V*1 vector, which is the index of voxels within the brain mask; ImageSize_Data.mat is the size of image data, e.g., 1*3 vector means 3- D image data; SNP_Data.txt includes the genetic data for all subjects; and SNP_info.map contains the SNP information including SNP names, chromosome IDs, and BP values for all loci. All the codes are stored in the folder Functions. The server- based main functions can be found in the folder DVDConquer. 1. Set DVDConquer as the current folder and run FVGWAS_Parameter_DVD.m 2. Run FVGWAS_GSIS_server_DVD.m 3. After Step 2 is finished, run script split.scp. Parameter Bsize, the maximum size of each genotype subset, can be changed according to user need. 4. After Step 3 is finished, run FVGWAS_Bootstrap_Top_server_DVD.m with different subset index. For example, if there are 10 genotype subsets, the user can simultaneously run FVGWAS_Bootstrap_Top_server_DVD(1), FVGWAS_Bootstrap_Top_server_DVD(2),, FVGWAS_Bootstrap_Top_server_DVD (10) 5. After all jobs in Step 4 are finished, run FVGWAS_Bootstrap_Cmb_server_DVD.m 6. After Step 5 is finished, run FVGWAS_Detection_server_DVD.m 7. Result Display

(1) Run FVGWAS_Manhattanplot_server_DVD.m in MatLab to generate the GWAS results. Output file: SNP_GSIS.txt. One thing should be noted that, in FVGWAS_Manhattanplot_server_DVD.m, there are two parameters should be modified according to your own settings. mode='ravenmap'; RavenMap is the name of the folder where the file SNP_GSIS.txt is saved. Flag=1; where 0 for ROI- based imaging data; 1 for voxel- based imaging data (2) Run FVGWAS_ManPlot.R in R to plot the Manhattan plot and QQ plot. One thing should be noted that, in FVGWAS_ManPlot.R, there are two parameters should be modified according to your own settings. statname="ravenmap"; RavenMap is the name of the folder where the file SNP_GSIS.txt is saved. HomeDir="C:/Users/Administrator/Desktop/FVGWAS- 3.0"; which is the directory the folder FVGWAS- 3.0 is located. 4. PS 1. If you want to run volumetric analysis via server- based pipeline. You need to change two parameters in the following four main functions in folder Server FVGWAS_GSIS_server.m FVGWAS_Bootstrap_server.m FVGWAS_Detection_server.m mode='ravenmap'; should be changed to mode='volumetric'; Flag=1; should be changed to Flag=0; 2. Divide and conquer based pipeline (Example 3.3) is summarized by bash script subjob.scp, which can be used directly in linux. You should run only one step each time, following the order in the script and change parameter settings according to the instruction. Also, you might also need to change command to submit matlab jobs based on the server setting. 3. If you want to run volumetric analysis via divide- and- conquer pipeline. You need to change two parameters in split.scp and FVGWAS_Parameter_DVD.m in folder DVDConquer.

mode='ravenmap'; should be changed to mode='volumetric'; Flag=1; should be changed to Flag=0; If you are using bash script subjob.scp,you should also change mode= RavenMap to mode= volumetric ; 4. Please remain the name of each input file as the one shown in the folder DATA. Otherwise, you also need to modify them in the functions/scripts shown above. 5. If you have a new data set. You can put your data folder into the folder DATA, modify the value of mode in the four functions above, e.g., if the name of the new data folder is Newdata, you need to change the value of mode to mode='newdata'. Also, if the new data set is voxel- type imaging data, you need to set Flag=1, otherwise, Flag=0.