Advanced Genomic data manipulation and Quality Control with plink Emile R. Chimusa (emile.chimusa@uct.ac.za) Division of Human Genetics Department of Pathology University of Cape Town
Outlines: 1.Introduction to Cluster Server 2.Introduction to plink 3.Genomics Data Quality Control
Introduction to Cluster Server Opening a terminal to connect to linux system or PBS server: 1.Mac OS X includes a Terminal application (located in the Applications >> Utilities folder), which can be used to connect to other systems. 1.From Ubuntu launch Terminal (Ctrl + Alt + T) and at the command prompt. Use dash board to search for a particular software 1.On Windows systems you can use a variety of programs to connect to a Linux system. PuTTY is free and the most used. By default the terminal prompts at your home folder. Connecting remotely to Linux Cluster Server, you will be prompted to your home directory (folder).
Introduction to Cluster Server
Introduction to Cluster Server Proxy server: is a dedicated computer acting as an intermediary between an endpoint device, such as a computer, and another server from which a user or client is requesting a service. Example: echimusa@lengau.chpc.ac.za echimusa@scp.chpc.ac.za echimusa@gmail.com Username Hostname:Domain or proxy address. How to connect to PBS server: >$ sss Username@proxy_address Example: >$ ssh echimusa@lengau.chpc.ac.za >$ ssh -X echimusa@lengau.chpc.ac.za
Introduction to Cluster Server When you sign in you will be located in your home directory. To see where this directory is located in the file system, use the pwd command: For example: echimusa@login2 ~]$ pwd /home/echimusa Now you should be in the home directory. To see what is inside of this directory, use the ls command (ls stands for list): [echimusa@login2 ~]$ ls get-pip.py hapfuse MarViN1 soft supportmix vcftools To change to a different directory, use the cd command (cd means change directory): [echimusa@login2 ~]$ cd /mnt/lustre/users/echimusa/ You can supply certain alias terms to the cd command. One of these is the character, which represents your home directory (/home/echimusa/). Another is.., which represents the directory above the current directory.
Introduction to Cluster Server To create your own directories use the mkdir (make directory) command: [echimusa@login2 ~]$ mkdir proteins [echimusa@login2 ~]$ cd proteins/ [echimusa@login2 proteins]$ ls [echimusa@login2 proteins]$ [echimusa@login2 proteins]$ pwd /home/echimusa/proteins To create new file, let use touch and nano see who else is signed in to the same system, use the who command: [echimusa@login2 proteins]$ touch my_sequence.sh [echimusa@login2 proteins]$ ls -l my_sequence.sh -rw-rw-r-- 1 echimusa echimusa 0 May 6 22:49 my_sequence.sh
AGe II. Getting Started: Basic commandc [echimusa@login2 proteins]$ nano my_sequence.sh
Introduction to Cluster Server CHPC uses the GNU modules utility, which manipulates your environment, to provide access to the supported software in /apps/. For a list of available modules: [echimusa@login2 ~]$ module avail To see currently loaded modules: [echimusa@login2 proteins]$ module list To remove all modules: [echimusa@login2 proteins]$ module list To load a modules: [echimusa@login2 proteins]$ module load name_module
my_sequence.sh Introduction to Cluster Server #!/bin/bash #PBS -N Xchr #PBS -q smp #PBS -P CBBI0818 #PBS -l select=1:ncpus=24 #PBS -l walltime=48:00:00 #PBS -M echimusa@gmail.com module load chpc/biomodules module load chpc/python/2.7.11 qstat: View queued jobs. (eg. qstat -u user_name), or to see what are on each queue (qstat -Q). qsub:submit a job to the scheduler. qdel :Delete one of your jobs from queue (qdel ID_of_your_job).
From both Mac and Ubuntu, we use the terminal to transfer the data from local to remote computer or from a remote to local machine. We commonly use scp, rsync, wget (curl) Synthax: rsync options source destination a) Introduction to Cluster Server :Transferring files -au: update files that are newer in the original directory b) scp options source destination -r: if copying folder From the ftp or internet source such as http://www.whatever.com/filename.txt c) wget options source -o path destination -nc => --no-clobber -N => Turn on time-stamping -r => Turn on recursive retrieving Is optional if copying in to current folder
Introduction to Cluster Server :Transferring files Transferring data from Windows: we can use winscp software: To use WinSCP, launch the program and enter the appropriate information into the Host name, User name, and Password text areas. Click Login to connect to the remote system. Once you are connected you should be able to transfer files and directories between systems using the simple graphical interface by dragging file to.
Introduction to Cluster Server :Transferring files Transferring data from Windows: we can use winscp software: Explore folder Explore folder Local machine Once you are connected you should be able to transfer files and directories between systems by dragging files or folder in between.
Connecting to CHPC and Downloading the Tutorial data
Connecting to CHPC and Downloading the Tutorial data 1. Connect to CHPC (a) windows users open PuTTY and use the given CHPC login details Please. (b) Linux or mac, just open the terminal and type Ssh YOUR_USERNAME@lengau.chpc.ac.za (and type your password) 2. Once connected, change directory as follows > cd /mnt/lustre/users/your_username/ (press enter) Download Tutorial from http://www.cbio.uct.ac.za/emile-chimusa/gwas_2017/tutorial.tar.gz by > wget http://www.cbio.uct.ac.za/emile-chimusa/gwas_2017/tutorial.tar.gz > tar xvf Tutorial.tar.gz > cd Tutorial > ls
Tutorial data and Script to run jobs at CHPC In side Tutorial folder: A. SHELL: folder containing some linux scripts to be use at HPC 1. For PCA: run_pca.sh (this script uses the prepared data in step 1, and calls two python scripts to run smartpca to conduct PCA. (Again #PBSs on top of the file specify the allocation for the Server and following by Working, data and software directory variables) etc. 2. Admixture (population structure):qsub_admixture.sh and runcontinent2.sh. This is a clustering method that needs you to per-specify the number of possible clusters in you data. Will be running just for K=2, 3,4 see (qsub_admixture.sh ) which will submit runcontinent2.sh to run admixture software to the server. B. Genesis_tutorial : This folder contains the software Genesis and basic data that I demonstrated in the last class. Once you have your results from both PCA and admixture, you will use Genesis for plotting.
Tutorial data and Script to run jobs at CHPC In side Tutorial folder: C. population_structure_data (in we have the follows: Africa55K_10Pops.fam,.bed,.bim): Folder containing the Africa data (remember our target data are HAZDA and SADAWE (Tanzania)) Will try to investigate their population structure again other populations in the whole dataset. D. software : Contains all you software, except (smartpca) E. GWAS_data: has the gwas data (GWAS.ped,.pedind,.map for ~100 cases and 874 controls). This folder has also run_gwas.sh script that contains script lines to run GWAS (pre-gwas(qc) and association test and some adjustment), it contains also an R script to plot q-q plot (qqplot.r), and Mahanatha.py (to plot the Mahanatha plot). In addition, the way to run them can be found in run_gwas.sh.
Introduction to plink Get plink run 1.Download/Install/Run PLINK: https://www.cog-genomics.org/plink2 1.Windows users, then unzip the downloaded file. Copy the Application file plink.exe and paste it in a folder called "Plink" (or whatever name you give) in whatever location in your computer (convenient if you create a folder plink in C: drive). 2.Clink Start > Run (or, Start> Search Programs and Files) and then type "cmd" and hit Enter to open command mode. 3.Go to the directory (folder) called Plink in command mode (where you have pasted the application file plink.exe. ). If it on C:\plink 4.To go back to parent directory, type cd.. until you reach to C: drive
Introduction to plink Popular Genomics data format Encoded data T/A G/C G/A A/T C/G A/G emile AA CC AA 2 2 2 Annie AT CG AA 1 1 2 Gaston AA CG AA => 0 1 2 Jacqui TT GG GG 0 0 0 Ephie TT CC GG 0 2 0 Imani TA CG GG 1 1 0 Annotation A good ranking strategy would produce SNP3, SNP1, SNP2 coded based on count of minor allele
1. Standard format: map and ped files (ped file is very wide if there are much more SNP than individuals as SNP goes in columns). 2. Binary format: bed, bim, fam files (compact files, size about 1/10th of original map/ped files). 3. VCF (.gz) file. 4. Oxford format gen (.gz). Introduction to plink Popular Genomics data format
Popular Genomics data format Introduction to plink Format Input option Output Option PED/MAP --file --recode --out BED/BIM/FAM --bffile --make-bed --out TPED/TFAM --tfile --recode --transpose RAW (coded on count of minor allele) None --recodea LGEN/MAP/FAM --lfile --recode-lgen VCF (.gz) --vcf --recode vcf Note that for the PED format, alleles can be encoded as ACGT or 1234. The --alleleacgt, allele12 and --allele1234 options can be used to do conversion you have to use the recode or --make-bed too plink --file filename --make-bed --options More detail at https://www.cog-genomics.org/plink/1.9/input
1. Convert data from bed, bim, fam files to VCF filet: plink bfile example recode vcf example thus vcf back to bed, bim, fam plink --vcf example.vcf --double-id --vcf-require-gt --biallelic-only strict --missing-genotype 0 allow-extra-chr recode make-bed example2 2. Convert bed, bim, fam to tped Introduction to plink Popular Genomics data format : Examples plink bfile example2 recode --transpose tpexample 3. Convert ped/map to bed, bim, fam to tped plink file example recode make-bed example3 plink tfile tpexample recode --transpose example
Introduction to plink Popular Genomics data format : Slicing, dicing,... Inserting the plink below parameter to previous command Data of a particular chromosome --chr (extracting data of a specific chromosome) --maf (extract data, where minor allele frequency SNPs > to a specified values) --mind (remove of samples data with % of missing ) --geno (removal of genotypes with specified % error rate) --hwe (removal of with deviation from HWE) Get subset of SNPs --snps ( to extract a SNPs or range of SNPs --extract --exclude Get subset of Samples --keep sample.txt --remove sample.txt Example: plink bfile example chr 22 recode example.22
Introduction to plink Popular Genomics data format : Slicing, dicing,... 1. Subsetting the data consisting of chromosome 22: > plink --bfile example --recode --chr 22 --out hap.chr22 2. Subsetting the data consisting of only males: > plink --bfile example --recode --filter-males --out hap_males 3. Subsetting the data consisting of only females: > plink --bfile example --recode --filter-females --out hap_females 4. Subsetting the data consisting of only cases: > plink --bfile example --recode --filter-cases --out hap_cases 5. Subsetting the data consisting of only controls: > plink --bfile example --recode --filter-controls --out hap_controls
Introduction to plink Popular Genomics data format : Exercise 1. Use the example data example.bed, example.bim,example.fam to convert to VCF file, and retain only data of chromosome 10 to the output 2. Use the example data example.bed, example.bim,example.fam to convert to tped file, retain only (a) sample in subsample_extract.txt (b) exclude samples in subsample.txt, write these (a) and (b) into different file where genotypes are coded as 1234 for (a) and 12 for (b) 3. Use the example data example.bed, example.bim,example.fam to (a) extract data from rs5770916 to rs9616985 and write the output into a ped/map format (b) extract SNPS in file SNP_extract.txt and write the output to bed/bim/fam format (c) exclude SNPS in file SNP_exclude.tx and write into vcf (d) write into VCF only common SNP (MAF= 0.05) of chromosome 1.
Quality Control Genomics Data Quality Control Removing bad SNPs and individuals: First, remove any individuals who have less than, say, 95% genotype data (--mind 0.05); and then remove SNPs that have less than, say 1% minor allele frequencies (--maf 0.01); and then remove SNPs that have less than, say, < 90% genotype call rate or >10% genotype error rate (--geno 0.1). removing individuals with genotyping error >5% and SNPs with maf <1% and genotype missing data <90% and SNPs with pvalues < 0.05 of deviation from HWE : > plink --bfile example --make-bed --mind 0.10 --maf 0.05 --geno 0.05 hwe 0.05 out Dclean
Work is done, relax on beach?