Emile R. Chimusa Division of Human Genetics Department of Pathology University of Cape Town

Similar documents
Introduction to Linux and PBS server

Rice Imputation Server tutorial

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

High Performance Computing (HPC) Club Training Session. Xinsheng (Shawn) Qin

Step-by-Step Guide to Basic Genetic Analysis

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

Polymorphism and Variant Analysis Lab

New User Tutorial. OSU High Performance Computing Center

PLATO User Guide. Current version: PLATO 2.1. Last modified: September Ritchie Lab, Geisinger Health System

Sharpen Exercise: Using HPC resources and running parallel applications

Session 1: Accessing MUGrid and Command Line Basics

CS CS Tutorial 2 2 Winter 2018

Unit: Making a move (using FTP)

Handling sam and vcf data, quality control

Genetic type 1 Error Calculator (GEC)

HPC Course Session 3 Running Applications

GMDR User Manual Version 1.0

Parallel Programming Pre-Assignment. Setting up the Software Environment

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Genetic Analysis. Page 1

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim

An Introduction to Cluster Computing Using Newton

The fgwas software. Version 1.0. Pennsylvannia State University

PARALLEL COMPUTING IN R USING WESTGRID CLUSTERS STATGEN GROUP MEETING 10/30/2017

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Logging in to the CRAY

Short Read Sequencing Analysis Workshop

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Calling variants in diploid or multiploid genomes

Helsinki 19 Jan Practical course in genome bioinformatics DAY 0

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

CS 215 Fundamentals of Programming II Spring 2019 Very Basic UNIX

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

CENG 334 Computer Networks. Laboratory I Linux Tutorial

Introduction to HPC Resources and Linux

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

PRSice: Polygenic Risk Score software - Vignette

bwunicluster Tutorial Access, Data Transfer, Compiling, Modulefiles, Batch Jobs

A Hands-On Tutorial: RNA Sequencing Using High-Performance Computing

CS Fundamentals of Programming II Fall Very Basic UNIX

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen

Batch Systems. Running calculations on HPC resources

Using CLC Genomics Workbench on Turing

bwunicluster Tutorial Access, Data Transfer, Compiling, Modulefiles, Batch Jobs

CS 261 Recitation 1 Compiling C on UNIX

Galaxy How To Remote Desktop Connection and SSH

Siemens PLM Software. HEEDS MDO Setting up a Windows-to- Linux Compute Resource.

User Guide Version 2.0

Sharpen Exercise: Using HPC resources and running parallel applications

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

Data transfer and RDS for HPC

Bitnami MEAN for Huawei Enterprise Cloud

Introduction to UNIX. SURF Research Boot Camp April Jeroen Engelberts Consultant Supercomputing

ChIP-seq Analysis Practical

The fgwas Package. Version 1.0. Pennsylvannia State University

Small example of use of OmicABEL

REAP Software Documentation

ICS-ACI System Basics

Author A.Kishore/Sachin WinSCP

Cheat Sheet on using Electric for Design and Simulations

NBIC TechTrack PBS Tutorial

The Command Shell. Fundamentals of Computer Science

Supercomputing environment TMA4280 Introduction to Supercomputing

Introduction in Unix. Linus Torvalds Ken Thompson & Dennis Ritchie

CMSC 201 Spring 2018 Lab 01 Hello World

Datathon 2018 Connecting to MicroStrategy on AWS Cloud

Introduction to Linux. Fundamentals of Computer Science

No Food or Drink in this room. Logon to Windows machine

Working with Basic Linux. Daniel Balagué

Using Sapelo2 Cluster at the GACRC

Step-by-Step Guide to Advanced Genetic Analysis

Bitnami Apache Solr for Huawei Enterprise Cloud

Ftp Command Line Commands Linux Example Windows Put

Setting up my Dev Environment ECS 030

Introduction to the Linux Command Line

Batch Systems. Running your jobs on an HPC machine

KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

Using a Linux System 6

Introduction to GALILEO

Introduction: What is Unix?

A short manual for LFMM (command-line version)

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ)

Quick Guide for the Torque Cluster Manager

HPC Introductory Course - Exercises

Linux for Biologists Part 2

Please include the following sentence in any works using center resources.

Computing with the Moore Cluster

WinSCP. Author A.Kishore/Sachin

Linux Training. for New Users of Cluster. Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala

Oregon State University School of Electrical Engineering and Computer Science. CS 261 Recitation 1. Spring 2011

CHE3935. Lecture 1. Introduction to Linux

Introduction to Linux Environment. Yun-Wen Chen

For Dr Landau s PHYS8602 course

Mills HPC Tutorial Series. Linux Basics II

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

RUNNING MOLECULAR DYNAMICS SIMULATIONS WITH CHARMM: A BRIEF TUTORIAL

Introduction to Discovery.

High Performance Computing (HPC) Using zcluster at GACRC

Transcription:

Advanced Genomic data manipulation and Quality Control with plink Emile R. Chimusa (emile.chimusa@uct.ac.za) Division of Human Genetics Department of Pathology University of Cape Town

Outlines: 1.Introduction to Cluster Server 2.Introduction to plink 3.Genomics Data Quality Control

Introduction to Cluster Server Opening a terminal to connect to linux system or PBS server: 1.Mac OS X includes a Terminal application (located in the Applications >> Utilities folder), which can be used to connect to other systems. 1.From Ubuntu launch Terminal (Ctrl + Alt + T) and at the command prompt. Use dash board to search for a particular software 1.On Windows systems you can use a variety of programs to connect to a Linux system. PuTTY is free and the most used. By default the terminal prompts at your home folder. Connecting remotely to Linux Cluster Server, you will be prompted to your home directory (folder).

Introduction to Cluster Server

Introduction to Cluster Server Proxy server: is a dedicated computer acting as an intermediary between an endpoint device, such as a computer, and another server from which a user or client is requesting a service. Example: echimusa@lengau.chpc.ac.za echimusa@scp.chpc.ac.za echimusa@gmail.com Username Hostname:Domain or proxy address. How to connect to PBS server: >$ sss Username@proxy_address Example: >$ ssh echimusa@lengau.chpc.ac.za >$ ssh -X echimusa@lengau.chpc.ac.za

Introduction to Cluster Server When you sign in you will be located in your home directory. To see where this directory is located in the file system, use the pwd command: For example: echimusa@login2 ~]$ pwd /home/echimusa Now you should be in the home directory. To see what is inside of this directory, use the ls command (ls stands for list): [echimusa@login2 ~]$ ls get-pip.py hapfuse MarViN1 soft supportmix vcftools To change to a different directory, use the cd command (cd means change directory): [echimusa@login2 ~]$ cd /mnt/lustre/users/echimusa/ You can supply certain alias terms to the cd command. One of these is the character, which represents your home directory (/home/echimusa/). Another is.., which represents the directory above the current directory.

Introduction to Cluster Server To create your own directories use the mkdir (make directory) command: [echimusa@login2 ~]$ mkdir proteins [echimusa@login2 ~]$ cd proteins/ [echimusa@login2 proteins]$ ls [echimusa@login2 proteins]$ [echimusa@login2 proteins]$ pwd /home/echimusa/proteins To create new file, let use touch and nano see who else is signed in to the same system, use the who command: [echimusa@login2 proteins]$ touch my_sequence.sh [echimusa@login2 proteins]$ ls -l my_sequence.sh -rw-rw-r-- 1 echimusa echimusa 0 May 6 22:49 my_sequence.sh

AGe II. Getting Started: Basic commandc [echimusa@login2 proteins]$ nano my_sequence.sh

Introduction to Cluster Server CHPC uses the GNU modules utility, which manipulates your environment, to provide access to the supported software in /apps/. For a list of available modules: [echimusa@login2 ~]$ module avail To see currently loaded modules: [echimusa@login2 proteins]$ module list To remove all modules: [echimusa@login2 proteins]$ module list To load a modules: [echimusa@login2 proteins]$ module load name_module

my_sequence.sh Introduction to Cluster Server #!/bin/bash #PBS -N Xchr #PBS -q smp #PBS -P CBBI0818 #PBS -l select=1:ncpus=24 #PBS -l walltime=48:00:00 #PBS -M echimusa@gmail.com module load chpc/biomodules module load chpc/python/2.7.11 qstat: View queued jobs. (eg. qstat -u user_name), or to see what are on each queue (qstat -Q). qsub:submit a job to the scheduler. qdel :Delete one of your jobs from queue (qdel ID_of_your_job).

From both Mac and Ubuntu, we use the terminal to transfer the data from local to remote computer or from a remote to local machine. We commonly use scp, rsync, wget (curl) Synthax: rsync options source destination a) Introduction to Cluster Server :Transferring files -au: update files that are newer in the original directory b) scp options source destination -r: if copying folder From the ftp or internet source such as http://www.whatever.com/filename.txt c) wget options source -o path destination -nc => --no-clobber -N => Turn on time-stamping -r => Turn on recursive retrieving Is optional if copying in to current folder

Introduction to Cluster Server :Transferring files Transferring data from Windows: we can use winscp software: To use WinSCP, launch the program and enter the appropriate information into the Host name, User name, and Password text areas. Click Login to connect to the remote system. Once you are connected you should be able to transfer files and directories between systems using the simple graphical interface by dragging file to.

Introduction to Cluster Server :Transferring files Transferring data from Windows: we can use winscp software: Explore folder Explore folder Local machine Once you are connected you should be able to transfer files and directories between systems by dragging files or folder in between.

Connecting to CHPC and Downloading the Tutorial data

Connecting to CHPC and Downloading the Tutorial data 1. Connect to CHPC (a) windows users open PuTTY and use the given CHPC login details Please. (b) Linux or mac, just open the terminal and type Ssh YOUR_USERNAME@lengau.chpc.ac.za (and type your password) 2. Once connected, change directory as follows > cd /mnt/lustre/users/your_username/ (press enter) Download Tutorial from http://www.cbio.uct.ac.za/emile-chimusa/gwas_2017/tutorial.tar.gz by > wget http://www.cbio.uct.ac.za/emile-chimusa/gwas_2017/tutorial.tar.gz > tar xvf Tutorial.tar.gz > cd Tutorial > ls

Tutorial data and Script to run jobs at CHPC In side Tutorial folder: A. SHELL: folder containing some linux scripts to be use at HPC 1. For PCA: run_pca.sh (this script uses the prepared data in step 1, and calls two python scripts to run smartpca to conduct PCA. (Again #PBSs on top of the file specify the allocation for the Server and following by Working, data and software directory variables) etc. 2. Admixture (population structure):qsub_admixture.sh and runcontinent2.sh. This is a clustering method that needs you to per-specify the number of possible clusters in you data. Will be running just for K=2, 3,4 see (qsub_admixture.sh ) which will submit runcontinent2.sh to run admixture software to the server. B. Genesis_tutorial : This folder contains the software Genesis and basic data that I demonstrated in the last class. Once you have your results from both PCA and admixture, you will use Genesis for plotting.

Tutorial data and Script to run jobs at CHPC In side Tutorial folder: C. population_structure_data (in we have the follows: Africa55K_10Pops.fam,.bed,.bim): Folder containing the Africa data (remember our target data are HAZDA and SADAWE (Tanzania)) Will try to investigate their population structure again other populations in the whole dataset. D. software : Contains all you software, except (smartpca) E. GWAS_data: has the gwas data (GWAS.ped,.pedind,.map for ~100 cases and 874 controls). This folder has also run_gwas.sh script that contains script lines to run GWAS (pre-gwas(qc) and association test and some adjustment), it contains also an R script to plot q-q plot (qqplot.r), and Mahanatha.py (to plot the Mahanatha plot). In addition, the way to run them can be found in run_gwas.sh.

Introduction to plink Get plink run 1.Download/Install/Run PLINK: https://www.cog-genomics.org/plink2 1.Windows users, then unzip the downloaded file. Copy the Application file plink.exe and paste it in a folder called "Plink" (or whatever name you give) in whatever location in your computer (convenient if you create a folder plink in C: drive). 2.Clink Start > Run (or, Start> Search Programs and Files) and then type "cmd" and hit Enter to open command mode. 3.Go to the directory (folder) called Plink in command mode (where you have pasted the application file plink.exe. ). If it on C:\plink 4.To go back to parent directory, type cd.. until you reach to C: drive

Introduction to plink Popular Genomics data format Encoded data T/A G/C G/A A/T C/G A/G emile AA CC AA 2 2 2 Annie AT CG AA 1 1 2 Gaston AA CG AA => 0 1 2 Jacqui TT GG GG 0 0 0 Ephie TT CC GG 0 2 0 Imani TA CG GG 1 1 0 Annotation A good ranking strategy would produce SNP3, SNP1, SNP2 coded based on count of minor allele

1. Standard format: map and ped files (ped file is very wide if there are much more SNP than individuals as SNP goes in columns). 2. Binary format: bed, bim, fam files (compact files, size about 1/10th of original map/ped files). 3. VCF (.gz) file. 4. Oxford format gen (.gz). Introduction to plink Popular Genomics data format

Popular Genomics data format Introduction to plink Format Input option Output Option PED/MAP --file --recode --out BED/BIM/FAM --bffile --make-bed --out TPED/TFAM --tfile --recode --transpose RAW (coded on count of minor allele) None --recodea LGEN/MAP/FAM --lfile --recode-lgen VCF (.gz) --vcf --recode vcf Note that for the PED format, alleles can be encoded as ACGT or 1234. The --alleleacgt, allele12 and --allele1234 options can be used to do conversion you have to use the recode or --make-bed too plink --file filename --make-bed --options More detail at https://www.cog-genomics.org/plink/1.9/input

1. Convert data from bed, bim, fam files to VCF filet: plink bfile example recode vcf example thus vcf back to bed, bim, fam plink --vcf example.vcf --double-id --vcf-require-gt --biallelic-only strict --missing-genotype 0 allow-extra-chr recode make-bed example2 2. Convert bed, bim, fam to tped Introduction to plink Popular Genomics data format : Examples plink bfile example2 recode --transpose tpexample 3. Convert ped/map to bed, bim, fam to tped plink file example recode make-bed example3 plink tfile tpexample recode --transpose example

Introduction to plink Popular Genomics data format : Slicing, dicing,... Inserting the plink below parameter to previous command Data of a particular chromosome --chr (extracting data of a specific chromosome) --maf (extract data, where minor allele frequency SNPs > to a specified values) --mind (remove of samples data with % of missing ) --geno (removal of genotypes with specified % error rate) --hwe (removal of with deviation from HWE) Get subset of SNPs --snps ( to extract a SNPs or range of SNPs --extract --exclude Get subset of Samples --keep sample.txt --remove sample.txt Example: plink bfile example chr 22 recode example.22

Introduction to plink Popular Genomics data format : Slicing, dicing,... 1. Subsetting the data consisting of chromosome 22: > plink --bfile example --recode --chr 22 --out hap.chr22 2. Subsetting the data consisting of only males: > plink --bfile example --recode --filter-males --out hap_males 3. Subsetting the data consisting of only females: > plink --bfile example --recode --filter-females --out hap_females 4. Subsetting the data consisting of only cases: > plink --bfile example --recode --filter-cases --out hap_cases 5. Subsetting the data consisting of only controls: > plink --bfile example --recode --filter-controls --out hap_controls

Introduction to plink Popular Genomics data format : Exercise 1. Use the example data example.bed, example.bim,example.fam to convert to VCF file, and retain only data of chromosome 10 to the output 2. Use the example data example.bed, example.bim,example.fam to convert to tped file, retain only (a) sample in subsample_extract.txt (b) exclude samples in subsample.txt, write these (a) and (b) into different file where genotypes are coded as 1234 for (a) and 12 for (b) 3. Use the example data example.bed, example.bim,example.fam to (a) extract data from rs5770916 to rs9616985 and write the output into a ped/map format (b) extract SNPS in file SNP_extract.txt and write the output to bed/bim/fam format (c) exclude SNPS in file SNP_exclude.tx and write into vcf (d) write into VCF only common SNP (MAF= 0.05) of chromosome 1.

Quality Control Genomics Data Quality Control Removing bad SNPs and individuals: First, remove any individuals who have less than, say, 95% genotype data (--mind 0.05); and then remove SNPs that have less than, say 1% minor allele frequencies (--maf 0.01); and then remove SNPs that have less than, say, < 90% genotype call rate or >10% genotype error rate (--geno 0.1). removing individuals with genotyping error >5% and SNPs with maf <1% and genotype missing data <90% and SNPs with pvalues < 0.05 of deviation from HWE : > plink --bfile example --make-bed --mind 0.10 --maf 0.05 --geno 0.05 hwe 0.05 out Dclean

Work is done, relax on beach?