Ricopili: Introdution WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015
What will we offer? Practical: Sorry, no practical sessions today, please refer to the summer school, organized by the PGC Theoretical: Ricopili Pipeline, usage PGC Data organization on LISA Non - pipeline GWAS analyses (polyscore, ldscore, pathways) Experiences from the past, working with hundreds of GWAS datasets from various sources General lectures on GWAS and beyond
Outline for this session Ricopili pipeline History, Application Pros / Cons General workflow, structure Short presentation of 4 core modules Preimputation Principal Component Analysis Imputation Postimputation
History Development, building in the course of the last six years. Just to make my life easier, to be able to handle large amounts of data from the Psychiatric Genomics Consortium. I am not a programmer. This was not a professional software project and it s still in it s evolution process.
http://www.nimh.nih.gov/about/strategic-planning-reports/highlights/highlight-skyline-drivers.shtml released: march 2015
PGC SCZ (June 2014) 35476 cases, 46839 controls 97 genome wide significant sites
QQ plot
Lambda plot over Freq ex.
Calcium Channels (e.g. CACNA1C, chr. 12) are amongst the associated hits CACNB2 (chr. 10) CACNA1L (chr. 22)
Increase in polygenic risk score prediction
Discoveries over samplesize N Hits (p < 5.0 * 10-08 ) Schizophrenia: ~ 4 / 1,000 Crohn s: ~ 10 / 1,000 Adult Height: ~ 3 / 1,000 (Bipolar Disorder: ~ 3-4 / 1,000) N cases
Three major elements turned the tide 1) Genome Resources 2) Technology 3) Collaboration The pipeline is not a major contributor, but enables interaction between these three components
Pros efficient use of computer cluster, optimized for infrastructure with hundreds / thousands nodes. Memory requirement and walltime requirement of single jobs minimal, high speed even for very big projects Modular system, relatively easy to alter / expand (for me!) Standardized naming keeps overview in big projects Highly automated, you get final results within days with very little interaction (if data is good!!) High quality (publication ready) output files in various flavors, actionable (R scripts provided).
Cons Scripts are not cleanly written not easy for outsiders to understand / expand / find bugs Highly interwoven in the cluster-structure not easy to port to different system, almost impossible to run on a single machine Breaks easily if not used as intended. High frustration level for new users
Advice Use the google groups, ask early: https://groups.google.com/a/broadinstitute.org/f orum/#!forum/rp-users If you can help, please help Report back even if you found the problem and fixed yourself, maybe we can integrate into the pipeline for next the user. Github is up and running: https://github.com/nealelab/ricopili/wiki
General structure of the pipeline
Ricopili 4 modules Interaction necessary Mostly standardized
Each module has a very different computational / user demand
Preimputation, three steps 1) Guess genotyping platform 2) standardized file renaming adding information to ID names (they get unique across datasets) 3) Technical QC with classic parameters extended report with various plots (Manhattan, QQ, Lambda, Histograms) Expected runtime with recent examples 6000 individuals: 40 mins (highly dependent on how many datasets the data is spread over) Needs interaction, so runtime is not the as important
Principal Components Analysis, 5 steps 1) Pruning, filtering 2) Overlap / Relatedness testing 3) PCA 4) Association of PCAs with genotypes 5) Plots in various flavors 600 individuals: 5 minutes 40,000 individuals: 1 hour Used for two reasons: 1) Genomic QC, excluding ancestrial outliers 2) Creating covariates for final analysis Deduping over all datasets in final analysis
Imputation, 11 steps 1) Guess genome build 2) Align positions, snpnames 3) Align alleles 4) Cut into genomic chunks 5) Prephasing 1000 individuals: 4 hours 15,000 individuals: 48 hours 6) Imputation 7) Data Reformatting 8) Postimputation QC / Best Guess 9) Genome wide best guess 10) Clean 11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs) Steps 5,6,7 take > 90% of the computer resources of this module
Postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2
Working directories QC: as many pre-qc datasets as you have Different directory for different phenotype / ancestry One pipeline run per directory Imputation Subdirectory of qc One pipeline run per directory As many post-qc datasets you have Different directory for different ancestry Easy merging of imputation directories afterwards
Working directories Postimputation: Within imputation subdirectory (it s possible to merge imputation directories) Many parallel pipeline runs possible (different --addout) Working directory and result directory get distinct names PCA As many pre/post-qc/imputation datasets you whish Many parallel pipeline runs in parallel possible per directory
Directory structure, visual representation
Several non-pipeline analyses we probably won t have time to cover here Replication Polygenic Scoring Family (trio) imputation Chromosome X imputation Inclusion of results at Ricopili website: http://www.broadinstitute.org/mpg/ricopili/
Installation of the pipeline, installing some useful alias https://sites.google.com/a/broadinstitute.org/ricopili Short introduction: alias c='sed "s#.*#scp $(whoami)@lisa.surfsara.nl:$(pwd)/&.#"' alias ls='ls --color=auto' alias cl='column -t Get yourself familiar with emacs or vim Pubkey (id_rsa.pub), not covered here