Ricopili: Introdution. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Similar documents
FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

PRSice: Polygenic Risk Score software - Vignette

Step-by-Step Guide to Advanced Genetic Analysis

PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP

GWAS Exercises 3 - GWAS with a Quantiative Trait

PRSice: Polygenic Risk Score software v1.22

KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

500K Data Analysis Workflow using BRLMM

1. Summary statistics test_gwas. This file contains a set of 50K random SNPs of the Subjective Well-being GWAS of the Netherlands Twin Register

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

ChIP-Seq Tutorial on Galaxy

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation

Step-by-Step Guide to Relatedness and Association Mapping Contents

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

MAGA: Meta-Analysis of Gene-level Associations

Polymorphism and Variant Analysis Lab

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data

MAGMA manual (version 1.06)

snpqc an R pipeline for quality control of Illumina SNP data

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Analyzing Big Data with Microsoft R

Package MultiMeta. February 19, 2015

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

The H3ABioNet GWAS Pipeline

Step-by-Step Guide to Basic Genetic Analysis

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

Release Notes. JMP Genomics. Version 4.0

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

BOLT-LMM v2.3 User Manual

MAGMA manual (version 1.05)

Rice Imputation Server tutorial

LEA: An R Package for Landscape and Ecological Association Studies

iloci software is used to calculate the gene-gene interactions from GWAS data. This software was implemented by the OpenCL framework.

Intro to NGS Tutorial

CTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1

ChIP-seq hands-on practical using Galaxy

The fgwas Package. Version 1.0. Pennsylvannia State University

BOLT-LMM v2.3.2 User Manual

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

Scaling bio-analyses from computational clusters to grids

Package lodgwas. R topics documented: November 30, Type Package

Four steps in an effective workflow...

JMP Genomics. Release Notes. Version 6.0

Maximizing Public Data Sources for Sequencing and GWAS

Importing and Merging Data Tutorial

SFDR (Stratified False Discovery Rate) Software Documentation. Version 1.6 Feb 7, 2010

MAGMA joint modelling options and QC read-me (v1.07a)

Spotter Documentation Version 0.5, Released 4/12/2010

Handling sam and vcf data, quality control

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Missing Data and Imputation

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

MPG NGS workshop I: Quality assessment of SNP calls

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

CLC Server. End User USER MANUAL

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Introduction to Hail. Cotton Seed, Technical Lead Tim Poterba, Software Engineer Hail Team, Neale Lab Broad Institute and MGH

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

What can we do today that we couldn t do before.

Project 3 Q&A. Jonathan Krause

CS1 Lecture 30 Apr. 2, 2018

BOLT-LMM v2.0 User Manual

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

The fgwas software. Version 1.0. Pennsylvannia State University

CSE 124: Networked Services Lecture-16

Applications of admixture models

Recalling Genotypes with BEAGLECALL Tutorial

10 things I wish I knew. about Machine Learning Competitions

GSCAN GWAS Analysis Plan, v GSCAN GWAS ANALYSIS PLAN, Version 1.0 October 6, 2015

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

CSE 124: Networked Services Fall 2009 Lecture-19

Helpful Galaxy screencasts are available at:

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

Emile R. Chimusa Division of Human Genetics Department of Pathology University of Cape Town

EATING THE ELEPHANT UNDERSTANDING AND IMPLEMENTING SQL SERVER PARTITIONING. Mike Fal -

Sequence Mapping and Assembly

srap: Simplified RNA-Seq Analysis Pipeline

Genetic Analysis. Page 1

ToCatchAThief c ryan campbell & jenn coughlan 7/23/2018

User Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform

Cover Page. The handle holds various files of this Leiden University dissertation.

Genetic type 1 Error Calculator (GEC)

Howdah. a flexible pipeline framework and applications to analyzing genomic data. Steven Lewis PhD

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Package QCEWAS. R topics documented: February 1, Type Package

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Variant calling using SAMtools

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

How To: Run the ENCODE histone ChIP- seq analysis pipeline on DNAnexus

Parallel Motif Search Using ParSeq

Calling variants in diploid or multiploid genomes

Using the GBS Analysis Pipeline Tutorial

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

Exploratory data analysis with one and two variables

Transcription:

Ricopili: Introdution WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

What will we offer? Practical: Sorry, no practical sessions today, please refer to the summer school, organized by the PGC Theoretical: Ricopili Pipeline, usage PGC Data organization on LISA Non - pipeline GWAS analyses (polyscore, ldscore, pathways) Experiences from the past, working with hundreds of GWAS datasets from various sources General lectures on GWAS and beyond

Outline for this session Ricopili pipeline History, Application Pros / Cons General workflow, structure Short presentation of 4 core modules Preimputation Principal Component Analysis Imputation Postimputation

History Development, building in the course of the last six years. Just to make my life easier, to be able to handle large amounts of data from the Psychiatric Genomics Consortium. I am not a programmer. This was not a professional software project and it s still in it s evolution process.

http://www.nimh.nih.gov/about/strategic-planning-reports/highlights/highlight-skyline-drivers.shtml released: march 2015

PGC SCZ (June 2014) 35476 cases, 46839 controls 97 genome wide significant sites

QQ plot

Lambda plot over Freq ex.

Calcium Channels (e.g. CACNA1C, chr. 12) are amongst the associated hits CACNB2 (chr. 10) CACNA1L (chr. 22)

Increase in polygenic risk score prediction

Discoveries over samplesize N Hits (p < 5.0 * 10-08 ) Schizophrenia: ~ 4 / 1,000 Crohn s: ~ 10 / 1,000 Adult Height: ~ 3 / 1,000 (Bipolar Disorder: ~ 3-4 / 1,000) N cases

Three major elements turned the tide 1) Genome Resources 2) Technology 3) Collaboration The pipeline is not a major contributor, but enables interaction between these three components

Pros efficient use of computer cluster, optimized for infrastructure with hundreds / thousands nodes. Memory requirement and walltime requirement of single jobs minimal, high speed even for very big projects Modular system, relatively easy to alter / expand (for me!) Standardized naming keeps overview in big projects Highly automated, you get final results within days with very little interaction (if data is good!!) High quality (publication ready) output files in various flavors, actionable (R scripts provided).

Cons Scripts are not cleanly written not easy for outsiders to understand / expand / find bugs Highly interwoven in the cluster-structure not easy to port to different system, almost impossible to run on a single machine Breaks easily if not used as intended. High frustration level for new users

Advice Use the google groups, ask early: https://groups.google.com/a/broadinstitute.org/f orum/#!forum/rp-users If you can help, please help Report back even if you found the problem and fixed yourself, maybe we can integrate into the pipeline for next the user. Github is up and running: https://github.com/nealelab/ricopili/wiki

General structure of the pipeline

Ricopili 4 modules Interaction necessary Mostly standardized

Each module has a very different computational / user demand

Preimputation, three steps 1) Guess genotyping platform 2) standardized file renaming adding information to ID names (they get unique across datasets) 3) Technical QC with classic parameters extended report with various plots (Manhattan, QQ, Lambda, Histograms) Expected runtime with recent examples 6000 individuals: 40 mins (highly dependent on how many datasets the data is spread over) Needs interaction, so runtime is not the as important

Principal Components Analysis, 5 steps 1) Pruning, filtering 2) Overlap / Relatedness testing 3) PCA 4) Association of PCAs with genotypes 5) Plots in various flavors 600 individuals: 5 minutes 40,000 individuals: 1 hour Used for two reasons: 1) Genomic QC, excluding ancestrial outliers 2) Creating covariates for final analysis Deduping over all datasets in final analysis

Imputation, 11 steps 1) Guess genome build 2) Align positions, snpnames 3) Align alleles 4) Cut into genomic chunks 5) Prephasing 1000 individuals: 4 hours 15,000 individuals: 48 hours 6) Imputation 7) Data Reformatting 8) Postimputation QC / Best Guess 9) Genome wide best guess 10) Clean 11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs) Steps 5,6,7 take > 90% of the computer resources of this module

Postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2

Working directories QC: as many pre-qc datasets as you have Different directory for different phenotype / ancestry One pipeline run per directory Imputation Subdirectory of qc One pipeline run per directory As many post-qc datasets you have Different directory for different ancestry Easy merging of imputation directories afterwards

Working directories Postimputation: Within imputation subdirectory (it s possible to merge imputation directories) Many parallel pipeline runs possible (different --addout) Working directory and result directory get distinct names PCA As many pre/post-qc/imputation datasets you whish Many parallel pipeline runs in parallel possible per directory

Directory structure, visual representation

Several non-pipeline analyses we probably won t have time to cover here Replication Polygenic Scoring Family (trio) imputation Chromosome X imputation Inclusion of results at Ricopili website: http://www.broadinstitute.org/mpg/ricopili/

Installation of the pipeline, installing some useful alias https://sites.google.com/a/broadinstitute.org/ricopili Short introduction: alias c='sed "s#.*#scp $(whoami)@lisa.surfsara.nl:$(pwd)/&.#"' alias ls='ls --color=auto' alias cl='column -t Get yourself familiar with emacs or vim Pubkey (id_rsa.pub), not covered here