Ricopili: Introdution. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Size: px

Start display at page:

Download "Ricopili: Introdution. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015"

Alberta Simon
6 years ago
Views:

1 Ricopili: Introdution WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

by the PGC Theoretical: Ricopili Pipeline, usage PGC Data organization on LISA Non - pipeline

2 What will we offer? Practical: Sorry, no practical sessions today, please refer to the summer school, organized by the PGC Theoretical: Ricopili Pipeline, usage PGC Data organization on LISA Non - pipeline GWAS analyses (polyscore, ldscore, pathways) Experiences from the past, working with hundreds of GWAS datasets from various sources General lectures on GWAS and beyond

3 Outline for this session Ricopili pipeline History, Application Pros / Cons General workflow, structure Short presentation of 4 core modules Preimputation Principal Component Analysis Imputation Postimputation

4 History Development, building in the course of the last six years. Just to make my life easier, to be able to handle large amounts of data from the Psychiatric Genomics Consortium. I am not a programmer. This was not a professional software project and it s still in it s evolution process.

6 released: march 2015

7 PGC SCZ (June 2014) cases, controls 97 genome wide significant sites

8 QQ plot

9 Lambda plot over Freq ex.

10 Calcium Channels (e.g. CACNA1C, chr. 12) are amongst the associated hits CACNB2 (chr. 10) CACNA1L (chr. 22)

11 Increase in polygenic risk score prediction

12 Discoveries over samplesize N Hits (p < 5.0 * ) Schizophrenia: ~ 4 / 1,000 Crohn s: ~ 10 / 1,000 Adult Height: ~ 3 / 1,000 (Bipolar Disorder: ~ 3-4 / 1,000) N cases

13 Three major elements turned the tide 1) Genome Resources 2) Technology 3) Collaboration The pipeline is not a major contributor, but enables interaction between these three components

14 Pros efficient use of computer cluster, optimized for infrastructure with hundreds / thousands nodes. Memory requirement and walltime requirement of single jobs minimal, high speed even for very big projects Modular system, relatively easy to alter / expand (for me!) Standardized naming keeps overview in big projects Highly automated, you get final results within days with very little interaction (if data is good!!) High quality (publication ready) output files in various flavors, actionable (R scripts provided).

15 Cons Scripts are not cleanly written not easy for outsiders to understand / expand / find bugs Highly interwoven in the cluster-structure not easy to port to different system, almost impossible to run on a single machine Breaks easily if not used as intended. High frustration level for new users

16 Advice Use the google groups, ask early: orum/#!forum/rp-users If you can help, please help Report back even if you found the problem and fixed yourself, maybe we can integrate into the pipeline for next the user. Github is up and running:

17 General structure of the pipeline

18 Ricopili 4 modules Interaction necessary Mostly standardized

19 Each module has a very different computational / user demand

20 Preimputation, three steps 1) Guess genotyping platform 2) standardized file renaming adding information to ID names (they get unique across datasets) 3) Technical QC with classic parameters extended report with various plots (Manhattan, QQ, Lambda, Histograms) Expected runtime with recent examples 6000 individuals: 40 mins (highly dependent on how many datasets the data is spread over) Needs interaction, so runtime is not the as important

21 Principal Components Analysis, 5 steps 1) Pruning, filtering 2) Overlap / Relatedness testing 3) PCA 4) Association of PCAs with genotypes 5) Plots in various flavors 600 individuals: 5 minutes 40,000 individuals: 1 hour Used for two reasons: 1) Genomic QC, excluding ancestrial outliers 2) Creating covariates for final analysis Deduping over all datasets in final analysis

22 Imputation, 11 steps 1) Guess genome build 2) Align positions, snpnames 3) Align alleles 4) Cut into genomic chunks 5) Prephasing 1000 individuals: 4 hours 15,000 individuals: 48 hours 6) Imputation 7) Data Reformatting 8) Postimputation QC / Best Guess 9) Genome wide best guess 10) Clean 11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs) Steps 5,6,7 take > 90% of the computer resources of this module

23 Postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2

24 Working directories QC: as many pre-qc datasets as you have Different directory for different phenotype / ancestry One pipeline run per directory Imputation Subdirectory of qc One pipeline run per directory As many post-qc datasets you have Different directory for different ancestry Easy merging of imputation directories afterwards

25 Working directories Postimputation: Within imputation subdirectory (it s possible to merge imputation directories) Many parallel pipeline runs possible (different --addout) Working directory and result directory get distinct names PCA As many pre/post-qc/imputation datasets you whish Many parallel pipeline runs in parallel possible per directory

26 Directory structure, visual representation

27 Several non-pipeline analyses we probably won t have time to cover here Replication Polygenic Scoring Family (trio) imputation Chromosome X imputation Inclusion of results at Ricopili website:

28 Installation of the pipeline, installing some useful alias Short introduction: alias c='sed "s#.*#scp alias ls='ls --color=auto' alias cl='column -t Get yourself familiar with emacs or vim Pubkey (id_rsa.pub), not covered here

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS FVGWAS- 3.0 Manual Hongtu Zhu @ UNC BIAS Chao Huang @ UNC BIAS Nov 8, 2015 More and more large- scale imaging genetic studies are being widely conducted to collect a rich set of imaging, genetic, and clinical