Download PLINK from

Size: px

Start display at page:

Download "Download PLINK from"

Laurence Kennedy
6 years ago
Views:

1 PLINK tutorial Amended from two tutorials that the PLINK author Shaun Purcell wrote, see and 'Teaching materials and example dataset' at Download PLINK from In this tutorial, we will use PLINK to analyse some real and some example large-scale SNP data, to give a demonstration of what the program can do (e.g. data management, summary statistics and basic association analysis). What we do today might not be particularly realistic or accurate, but I hope it gives an idea of what PLINK is capable of! EXAMPLE DATASETS AND SOFTWARE 1. approximately 80,000 autosomal SNPs from the 89 Asian HapMap individuals (Han Chinese from Beijing and Japanese from Tokyo). A phenotype has been simulated based on the genotype at one SNP. Download from 2. a bigger file with all ~250,000 SNPs for 90 Asian HapMap individuals (Han Chinese from Beijing and Japanese from Tokyo), along with the simulated disease phenotype. In addition, a small subset of SNPs (N=29) genotyped on the same individuals represent a "follow-up genotyping" exercise are included, as well as a file with population membership (Chinese or Japanese). Download from 3. a file of 771 SNPs genotyped on a captive zebra finch pedigree CARDINAL RULES & CAVEATS When using PLINK there are a few key points to remember. Always consult the LOG file (console output) PLINK has no memory o each run loads data anew, previous filters lost Exact syntax and spelling is very important o minus minus

2 Not every option can be combined with every other option o For example, basic haplotype tests cannot take covariates o PLINK doesn t always warn you o LOG file often shows what has happened (or not) Consult the web documentation ( GETTING STARTED PLINK is a command line program, so we need to operate in a command line window. All commands involve typing plink at the command prompt (e.g. DOS window or Unix terminal) followed by a number of options (all starting with --option) to specify the data files / methods to be used. All results are written to files with various extensions. Putting your files in the same directory as the "plink.exe" file will let you do all analysis from this directory. Navigate to the plink directory, and to check you are in the correct folder, type plink and press enter, which should start PLINK and generate some output describing the program. If you get an error message, you are in the wrong directory. For most options, PLINK needs two plain text files: 1) a file with family ID, individual ID, father ID, mother ID, sex (1=male, 2 = female, other = unknown), phenotype and genotypes in columns for each individual, with the extension.ped. This file has NO HEADER so looks like this: FAM A A G G A C FAM A A A G 0 0 2) a file with marker information, including chromosome number, the SNP name, the position in morgans on the chromosome, and the position in base pairs on the chromosome, with the extension.map. Again, this file has NO HEADER so looks like this: 1 rs rs rs rs HINT! It is easiest if these files have the same name (e.g. sheep.ped and sheep.map). There are lots of options to change the format of these files (for example, you can provide genotypes as "AG" instead of "A G") - see the information on the PLINK website.

3 HINT! PLINK is designed for humans - so unless you tell it otherwise, it will assume your genome has 23 chromosomes, with chr23 being the X chromosome! Use the option --dog for up to 39 chromosomes 1. HAPMAP1 data Download the example data from and unzip the contents into your plink folder. A phenotype was simulated, so that a single SNP (rs ) should be associated with the 'disease'. The files are hapmap1.ped Genotype data for 83,000 SNPs on 90 individuals hapmap1.map Map file for these SNPs pop.phe Population membership coding (coded 1=CH / 2=JP) qt.phe Quantitative phenotype - we won't use this one Just typing plink and specifying a file with no further options is a good way to check that the file is intact, and to get some basic summary statistics about the file. plink --file hapmap1 The --file option takes a single parameter, the input file name, and will look for two files: a.ped file and a.map file with the name hapmap1 (i.e. hapmap1.ped and hapmap1.map). The above command should generate something like the following output in the console window. It will also save this information to a file called PLINK! v0.99l 27/Jul/ (C) 2006 Shaun Purcell, GNU General Public License, v Web-based version check ( --noweb to skip ) Connecting to web OK, v0.99l is current *** Pre-Release Testing Version *** Writing this text to log file [ plink.log ] Analysis started: Mon Jul 31 09:00: Options in effect: --file hapmap (of 83534) markers to be included from [ hapmap1.map ] 89 individuals read from [ hapmap1.ped ] 89 individuals with nonmissing phenotypes Assuming a binary trait (1=unaff, 2=aff, 0=miss)

4 Missing phenotype value is also -9 Before frequency and genotyping pruning, there are SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) 859 SNPs failed missingness test ( GENO > 0.1 ) SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are SNPs Analysis finished: Mon Jul 31 09:00: The information contained here can be summarized as follows: A banner showing copyright information and the version number -- the web-based version check shows that this is an up-to-date version of PLINK and displays a message that v0.99l is a pre-release testing version. A message indicating that the log file will be saved in plink.log. The name of the output file can be changed with the --out option -- e.g. specifying --out anal1 will generate a log file called anal1.log instead. A list of the command options specified is given next: in this case it is only a single option, --file hapmap1. By keeping track of log files, and naming each analysis with its own --outname, it makes it easier to keep track of when and how the different output files were generated. Next is some information on the number of markers and individuals read from the MAP and PED file. In total, just over 80,000 SNPs were read in from the MAP file. It is written "83534 (of 83534)" because some SNPs might be excluded (by making the physical position a negative number in the MAP file), in which case the first number would indicate how many SNPs are included. In this case, all SNPs are read in from the PED file. We also see that 89 individuals were read in from the PED file, and that all these individuals had valid phenotype information. Next, PLINK tells us that the phenotype is an affection status variable, as opposed to a quantitative trait, and lets us know what the missing values are. The next stage is the filtering stage -- individuals and/or SNPs are removed on the basis of thresholds. Please see this page for more information on setting thresholds. In this case we see that no individuals were removed, but almost 20,000 SNPs were removed, based on missingness (859) and frequency (16994). This particularly high proportion of removed SNPs is based on the fact that these are random HapMap SNPs in the Chinese and Japanese samples, rather than preselected markers on a whole-genome association product: there will be many more rare and monomorphic markers here than one would normally expect. Finally, a line is given that indicates when this analysis finished. You can see that it took 8 seconds (on my machine at least) to read in the file and apply the filters. If other analyses had been requested, then the other output files generated would have been indicated in the log file. HINT! All output files that PLINK generates have the same format: root.extension where root is, by default, "plink" but can be changed with the --out option, and the extension will depend on the type of output file it is. Making a binary PED file The first thing we will do is to make a binary PED file. This more compact representation of the data saves space and speeds up subsequent analysis. To make a binary PED file, use the following command.

5 plink --file hapmap1 --make-bed --out hapmap1 If it runs correctly on your machine, you should see the following in your output: above as before Before frequency and genotyping pruning, there are SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are SNPs Writing pedigree information to [ hapmap1.fam ] Writing map (extended format) information to [ hapmap1.bim ] Writing genotype bitfile to [ hapmap1.bed ] Using (default) SNP-major mode Analysis finished: Mon Jul 31 09:10: There are several things to note: When using the --make-bed option, the threshold filters for missing rates and allele frequency were automatically set to exclude nobody. Although these filters can be specified manually (using --mind, --geno and --maf) to exclude people, this default tends to be wanted when creating a new PED or binary PED file. The commands --extract / --exclude and --keep /-- remove can also be applied at this stage. Three files are created with this command -- the binary file that contains the raw genotype data hapmap1.bed but also a revsied map file hapmap1.bim which contains two extra columns that give the allele names for each SNP, and hapmap1.fam which is just the first six columns of hapmap1.ped. You can view the.bim and.fam files -- but do not try to view the.bed file. None of these three files should be manually editted. If, for example, you wanted to create a new file that only includes individuals with high genotyping (at least 95% complete), you would run: plink --file hapmap1 --make-bed --mind out highgeno which would create files highgeno.bed highgeno.bim highgeno.fam Working with the binary PED file To specify that the input data are in binary format, as opposed to the normal text PED/MAP format, just use the --bfile option instead of --file. To repeat the first command we ran (which just loads the data and prints some basic summary statistics): plink --bfile hapmap1 Writing this text to log file [ plink.log ] Analysis started: Mon Jul 31 09:12:

6 Options in effect: --bfile hapmap1 Reading map (extended format) from [ hapmap1.bim ] markers to be included from [ hapmap1.bim ] Reading pedigree information from [ hapmap1.fam ] 89 individuals read from [ hapmap1.fam ] 89 individuals with nonmissing phenotypes Reading genotype bitfile from [ hapmap1.bed ] Detected that binary PED file is v1.00 SNP-major mode Before frequency and genotyping pruning, there are SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) 859 SNPs failed missingness test ( GENO > 0.1 ) SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are SNPs Analysis finished: Mon Jul 31 09:12: The things to note here: That three files hapmap1.bim, hapmap1.fam and hapmap1.bed were loaded instead of the usual two files. That is, hapmap1.ped and hapmap1.map are not used in this analysis, and could in fact be deleted now. The data are loaded in much more quickly -- based on the timestamp at the beginning and end of the log output, this took 2 seconds instead of 10. Summary statistics: missing rates Next, we shall generate some simple summary statistics on rates of missing data in the file, using the -- missing option: plink --bfile hapmap1 --missing --out miss_stat which should generate the following output: 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) Writing individual missingness information to [ miss_stat.imiss ] Writing locus missingness information to [ miss_stat.lmiss ] Here we see that no individuals were removed for low genotypes (MIND > 0.1 implies that we accept people with less than 10 percent missingness). The per individual and per SNP (after excluding individuals on the basis of low genotyping) rates are then output to the files miss_stat.imiss and miss_stat.lmiss respectively. If we had not specified an -- out option, the root output filename would have defaulted to "plink". These output files are standard, plain text files that can be viewed in any text editor, pager, spreadsheet or statistics package (albeit one that can handle large files). Taking a look at the filemiss_stat.lmiss, for example using the more command which is present on most systems: more miss_stat.lmiss

7 we see CHR SNP N_MISS F_MISS 1 rs rs rs rs rs rs rs rs rs HINT! To exit from more, type 'q' to quit That is, for each SNP, we see the number of missing individuals (N_MISS) and the proportion of individuals missing (F_MISS). Similarly: more miss_stat.imiss we see FID IID MISS_PHENO N_MISS F_MISS HCB181 1 N HCB182 1 N HCB183 1 N HCB184 1 N HCB185 1 N HCB186 1 N HCB187 1 N The final column is the actual genotyping rate for that individual -- we see the genotyping rate is very high here. HINT If you are using a spreadsheet package that can only display a limited number of rows (some popular packages can handle just over 65,000 rows) then it might be desirable to ask PLINKto analyse the data by chromosome, using the --chr option. For example, to perform the above analysis for chromosome 1: plink --bfile hapmap1 --chr 1 --out res1 --missing then for chromosome 2: plink --bfile hapmap1 --chr 2 --out res2 --missing and so on.

8 Summary statistics: allele frequencies Next we perform a similar analysis, except requesting allele frequencies instead of genotyping rates. The following command generates a file called freq_stat.frq which contains the minor allele frequency and allele codes for each SNP. plink --bfile hapmap1 --freq --out freq_stat It is also possible to perform this frequency analysis (and the missingness analysis) stratified by a categorical, cluster variable. In this case, we shall use the file that indicates whether the individual is from the Chinese or the Japanese sample, pop.phe. This cluster file contains three columns; each row is an individual. The format is described more fully in the main documentation. To perform a stratified analysis, use the --within option. plink --bfile hapmap1 --freq --within pop.phe --out freq_stat The output will now indicate that a file called freq_stat.frq.strat. has been generated instead of freq_stat.frq. If we view this file: more freq_stat.frq.strat we see each row is now the allele frequency for each SNP stratifed by subpopulation: CHR SNP CLST A1 A2 MAF 1 rs rs rs rs rs rs rs rs Here we see that each SNP is represented twice - the CLST column indicates whether the frequency is from the Chinese or Japanese populations, coded as per the pop.phe file. If you were just interested in a specific SNP, and wanted to know what the frequency was in the two populations, you can use the --snp option to select this SNP: plink --bfile hapmap1 --snp rs freq --within pop.phe --out snp1_frq_stat would generate a file snp1_frq_stat.frq.strat containing only the population-specific frequencies for this single SNP. You can also specify a range of SNPs by adding the --window kb option or using the options -- from and --to, following each with a different SNP (they must be in the correct order and be on the same chromosome). Basic association analysis Let's now perform a basic association analysis on the disease trait for all single SNPs. The basic command is plink --bfile hapmap1 --assoc --out as1

9 which generates an output file as1.assoc which contains the following fields CHR SNP A1 F_A F_U A2 CHISQ P OR 1 rs rs rs rs rs rs rs rs rs where each row is a single SNP association result. The fields are: Chromosome SNP identifier Code for allele 1 (the minor, rare allele based on the entire sample frequencies) The frequency of this variant in cases The frequency of this variant in controls Code for the other allele The chi-squared statistic for this test (1 df) The asymptotic significance value for this test The odds ratio for this test If a test is not defined (for example, if the variant is monomorphic but was not excluded by the filters) then values of NA for not applicable will be given (as these are read by the package R to indicate missing data, which is convenient if using R to analyse the set of results). HINT In a Unix/Linux environment, you can use the available command line tools to sort the list of association statistics and print out the top ten, for example: sort --key=7 -nr as1.assoc head would give 13 rs e rs e Here we see that the simulated disease variant rs is actually the second most significant SNP in the list, with a large difference in allele frequencies of 0.28 in cases versus 0.62 in controls. However, we also see that, just by chance, a second SNP on chromosome 13 shows a slightly higher test result, with coincidentally similar allele frequencies in cases and controls. When performing so many tests, particularly in a small sample, we often expect the distribution of true positive results to be virtually indistinguishable from the best false positive results. That our variant appears in the top ten list is reassuring however. To get a sorted list of association results, that also includes a range of significance values that are adjusted for multiple testing, use the --adjust flag:

10 plink --bfile hapmap1 --assoc --adjust --out as2 This generates the file as2.assoc.adjust in addition to the basic as2.assoc output file. Using more (or opening the file), one can easily look at one's most significant associations: more as2.assoc.adjusted CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 13 rs e e rs e e rs e e rs e e rs e e rs e e rs e e rs e rs e rs e rs e Here we see a pre-sorted list of association results. The fields are as follows: Chromosome SNP identifier Unadjusted, asymptotic significance value Genomic control adjusted significance value. This is based on a simple estimation of the inflation factor based on median chi-square statistic. These values do not control for multiple testing therefore. Bonferroni adjusted significance value Holm step-down adjusted significance value Sidak single-step adjusted significance value Sidak step-down adjusted significance value Benjamini & Hochberg (1995) step-up FDR control Benjamini & Yekutieli (2001) step-up FDR control In this particular case, we see that no single variant is significant at the 0.05 level after genome-wide correction. Different correction measures have different properties which are beyond the scope of this tutorial to discuss: it is up to the investigator to decide which to use and how to interpret them. When the --adjust command is used, the log file records the inflation factor calculated for the genomic control analysis, and the mean chi-squared statistic (that should be 1 under the null): Genomic inflation factor (based on median chi-squared) is Mean chi-squared statistic is These values would actually suggest that although no very strong stratification exists, there is perhaps a hint of an increased false positive rate, as both values are greater than HINT The adjusted significance values that control for multiple testing are, by default, based on the unadjusted significance values. If the flag --gc is specified as well as --adjust then these adjusted values will be based on the genomic-control significance value instead. In this particular instance, where we already know about the Chinese/Japanese subpopulations, it might be of interest to directly look at the inflation factor that results from having population membership as the phenotype in a case/control analysis, just to provide extra information about the sample. That is,

11 running the command using the alternate phenotype option (i.e. replacing the disease phenotype with the one in pop.phe, which is actually subpopulation membership): plink --bfile hapmap1 --pheno pop.phe --assoc --adjust --out as3 we see that testing for frequency differences between Chinese and Japanese individuals, we do see some departure from the null distribution: Genomic inflation factor (based on median chi-squared) is Mean chi-squared statistic is That is, the inflation factor of 1.7 represents the maximum possible inflation factor if the disease were perfectly correlated with subpopulation that could arise from the Chinese/Japanese split in the sample (this does not account for any possible within-subpopulation structure, of course, that might also increase SNP-disease false positive rates). This is a good test of whether it is appropriate to do an association study without adjusting for population stratification. Extracting a SNP of interest Finally, given you've identified a SNP, set of SNPs or region of interest, you might want to extract those SNPs as a separate, smaller, more manageable file. In particular, for other applications to analyse the data, you will need to convert from the binary PED file format to a standard PED format. This is done using the --recode options. There are a few forms of this option: we will use the --recode12 that codes the genotypes in a manner that is convenient for subsequent analysis. To extract only this single SNP, use: plink --bfile hapmap1 --snp rs recode12 --out rec_snp1 This particular recode feature codes genotypes as 1/2 alleles, and outputs new.ped and.map files with this SNP. The files are wgas1.ped wgas1.map extra.ped extra.map pop.cov 2. LARGER SET OF HAPMAP1 DATA Genotype data for 250,000 SNPs on 90 individuals Map file for these SNPs Genotype data for an additional 29 SNPs genotyped for the same individuals Map file for these SNPs Population membership coding (coded 1=CH / 2=JP) First, make a new binary file of the data. Note this operation may take a while. plink --file wgas1 --make-bed --out wgas3 Previous analyses have shown that a SNP rs was the most highly associated with the phenotype. We now want to extract the data for rs and perform a series of more detailed analyses on this single SNP.

12 HINT Remember everything in the command should be typed on a single line (not across lines as shown in the boxes below) Purpose Command Input Output Notes Extract data for single SNP rs plink --bfile wgas3 --recode --snp rs out tophit wgas3.bed QC+ whole genome SNP binary fileset wgas3.bim wgas3.fam tophit.ped Standard PED file for this single SNP tophit.map Corresponding marker information We are converting back from the binary format to standard text format. The --snp command is a filter, just extracting data for this one SNP. For this single SNP, we shall next examine the genotyping rate and, second, the Hardy-Weinberg test statistic. In all cases, here and below, the analysis output files are small. These can be viewed by typing, for example, either the "more" or "type" DOS commands: more plink.lmiss or type plink.assoc Purpose Command Input Output Notes Examine genotyping rate for rs plink --file tophit --all --missing tophit.ped Standard PED file for single SNP tophit.map plink.lmiss Missing rate per locus (SNP) plink.imiss Missing rate per individual The --all flag is added because otherwise PLINK would first remove any individual with missing genotypes for this SNP, before calculating the per-snp genotyping rate. Also note use of --file instead of -bfile as tophit is in standard PED format. Finally, note that we do not always need to specify a unique output name when using PLINK directly, so all output files start plink.ext by default Purpose Command Input Examine Hardy-Weinberg equilibrium P-value for rs plink --file tophit --hardy tophit.ped Standard PED file for single SNP tophit.map Corresponding marker information

13 Output plink.hwe Hardy-Weinberg statistic and genotype counts Notes For case/control datasets, tests given for all individual, as well as for cases and controls separately Next, we can ask whether allele frequency differs between the two groups. This involves using the population label as the phenotype of an association test rather than as a covariate. Purpose Explicitly test whether allele frequency for rs differs between populations Command plink --file tophit --assoc --pheno pop.cov Input tophit.ped tophit.map pop.cov Standard PED file for single SNP Corresponding marker information Indicates Chinese (1) or Japanese (2) Output plink.assoc Association (with population) results Notes Here we specify population as the phenotype, not a covariate Purpose Explicitly test whether allele frequency for rs differs between populations, allowing for association with disease Command plink --file tophit --logistic --pheno pop.cov --covar tophit.ped --covar-number 4 Input tophit.ped tophit.map pop.cov Standard PED file for single SNP Corresponding marker information Indicates Chinese (1) or Japanese (2) Output plink.assoc.logistic Association (with population) results Notes We treat the PED file as a covariate file, extracting just the phenotype (i.e. the 4 th column after family ID and individual ID) These results would suggest that the frequency does indeed differ (again, make a note of exactly why this is). Population stratification Initially, we used the known population labels of Chinese versus Japanese. In many studies, we might not have this direct information, or the potential differences in ancestry can be subtle Analyses of population stratification should be performed on a set of SNPs that are approximately in linkage equilibrium: we achieve this by using PLINK's command to remove highly correlated, nearby SNPs. Note: this operation may take a while. Purpose Command Create a LD pruned set of markers (first step) plink --bfile wgas3

14 Input Output Notes --indep-pairwise out prune1 wgas3.bed QC+ whole genome SNP binary fileset wgas3.bim wgas3.fam prune1.prune.in List of SNPs included after pruning prune1.prune.out List of SNPs excluded after pruning This option does not actually remove any SNPs, it just creates two lists of SNPs, which we use below. This removes any SNP that has r-squared > 0.2 with another SNP within a 50-SNP window; this window is shifted across the chromosome 10 SNPs at a time. We next calculate identity-by-state (IBS) allelic similarity between of all possible pairs of all 89 QC+ individuals, and store this information in a file Purpose Calculate genome-wide IBS sharing based on pruned marker list Command plink --bfile wgas3 --extract prune1.prune.in --genome --out ibs1 Input wgas3.bed QC+ whole genome SNP binary fileset wgas3.bim wgas3.fam Output ibs1.genome IBS sharing data (1 row per pair of individuals) Notes Equivalently, one could --exclude prune1.prune.out Finally, using the pairwise IBS information in ibs1.genome, we perform stratification analysis: Note this may take a while. Purpose Cluster individuals into homogeneous groups and perform a multidimensional scaling analysis Command plink --bfile wgas3 --read-genome ibs1.genome --cluster --ppc 1e-3 --cc --mds-plot 2 --out strat1 Input wgas3.bed wgas3.bim wgas3.fam ibs1.genome QC+ whole genome SNP binary fileset Pre-calculated pairwise IBS values Output strat1.cluster2 strat1.mds Assignment to cluster for each individual First 2 MDS components for each individual Notes Constraints on clustering are the PPC test (--ppc 1e-3) and to ensure that each cluster contains at least one case and one control (--cc) Merge in new genotype data

15 The files extra.ped and extra.map contain new SNP data on the same set of individuals. These are SNPs taken from the region around rs , the best SNP in the previous WGAS analysis. We first examine these SNPs by themselves, and then merge them into the SNPs in that region from the original WGAS dataset. Purpose Examine the new SNPs, testing for association stratified by population Command plink --file extra --mh --within pop.cov --out strat2 Input extra.ped extra.map pop.cov New followup SNP genotyping Population label Output strat2.cmh CMH results for new genotypes Notes As evident in the result file strat2.cmh, there are some very strongly associated SNPs in this new set, in particular rs (with a P-value = ). We next merge this new data with the old. Purpose Command Input Output Notes Focus on region of association in WGAS data, and merge in new genotype data, creating a new fileset plink --bfile wgas3 --snp rs window merge extra.ped extra.map --make-bed --out followup wgas3.bed QC+ binary fileset wgas3.bim wgas3.fam extra.ped New genotype data (same individuals) extra.map followup.bed Merged fileset for region around top hit followup.bim followup.fam The --snp and --window commands extract a particular region from wgas3 first, and then merge in the new genotype data in extra.ped We can check that the associations remain the same after merging these two filesets: Purpose Command Input Re-run association to check integrity of file plink --bfile followup --mh --within pop.cov --out followup-cmh followup.bed Merged binary fileset for best region followup.bim followup.fam

16 Output followup-cmh.cmh CMH for top region in merged dataset Notes Now focusing on the top region, using --adjust is no longer appropriate Explore linkage disequilibrium Further analysis indicates four other SNPs that are associated and in LD with the primary SNP rs : rs rs rs rs rs Finally, we will extract just these five SNPs in another dataset Purpose Command Input Output Notes For convenience, focus on the 5 clumped SNPs for further analysis (and so create a new dataset containing just these) plink --bfile followup --snps rs ,rs ,rs ,rs ,rs make-bed --out followup2 followup.bed Merged binary fileset for best region followup.bim followup.fam followup2.bed Binary fileset of 5 SNPs in LD in top region followup2.bim followup2.fam Note that --snps (versus --snp) can take a comma-delimited list of SNPs The pairwise LD (r-squared) between these SNPs can also be calculated using PLINK. By default, only SNP pairs with high LD are shown in the output file. Purpose Report pairwise LD (r-squared) for SNPs in this region Command plink --bfile followup2 --r2 Input followup2.bed Merged binary fileset for best region followup2.bim followup2.fam Output plink.ld List of r-squared LD values (above threshold) Notes Add the --matrix option to get a 5 5 matrix of r-squared statistics 3. ZEBRA FINCH DATA Finally, we will examine some SNPs typed in a three generation zebra finch pedigree. The idea is to check the SNPs and individuals (i.e. quality control), then select unlinked SNPs for analysis in programs such as COANCESTRY and COLONY ( to investigate the pedigree and relatedness structure of the data.

17 HINT! PLINK is designed for humans - so unless you tell it otherwise, it will assume your genome has 23 chromosomes, with chr23 being the X chromosome! Use the option --dog for up to 39 chromosomes. The zebra finch genome has around 30 chromosomes. The zf.ped file is missing the column with the sex of the birds, so we use --allow-no-sex The zf.map file is in centimorgans not Morgans, so we use --cm Make bed file plink --file zf --cm --dog --make-bed --out zf How many SNPs are there? How many individuals are there? How many founders in the pedgiree? Is there phenotype information in the file? Test Hardy Weinberg equilibrium plink --bfile zf --cm --dog --hardy --out zf_hwe do any SNPs fail the HWE test? Test for parent-offspring mismatches plink --bfile zf --cm --dog --mendel --out zf_mendel do any individuals look like they're not related when we thought they were? Test for parent-offspring mismatches again, this time in a file with an error! plink --file zf_with_ped_error --cm --dog --mendel --out zf_mendel1 what is the incorrect pedigree link? missing rates per individual and locus plink --bfile zf --cm --dog --all --missing --out zf_miss_stat what is the maximum of missing genotypes for an individual? what is the maximum number of missing genotypes for a SNP? calculate allele frequencies in founders plink --bfile zf --cm --dog --freq --out zf_freq_stat what are the minimum and the maximum allele frequencies? delete individuals with more than 1% missing genotypes plink --bfile zf --cm --dog --mind out zf_highgeno how many individuals have been removed? Report pairwise LD (r-squared) for all SNPs plink --bfile zf --cm --dog --r2 --out zf_r2

18 how many SNPs are in perfect LD (r2 = 1)? Create a LD pruned set of markers (first step) plink --bfile zf --cm --dog --indep-pairwise out zf_prune how many SNPs are pruned from chromosome 1? how many SNPs are left in the dataset? Try to change the pruning parameters so that we end up with a datset of around 550 SNPs in 'zf_prune.prune.in' (hint: currently we have removed any SNP that has r-squared > 0.9 with another SNP within a 50-SNP window; this window is shifted across the chromosome 5 SNPs at a time). calculate genome-wide IBS sharing based on pruned marker list plink --bfile zf --cm --dog --extract zf_prune.prune.in --genome --out zf_ibs Extract the pruned SNPs into a file we might use in further analysis in other programs plink --bfile zf --allow-no-sex --cm --dog --extract zf_prune.prune.in --recode12 --out zf_pruned Can you extract the file so it shows genotypes as "12" rather than "1 2"?

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

BICF Nano Course: GWAS GWAS Workflow Development using PLINK Julia Kozlitina Julia.Kozlitina@UTSouthwestern.edu April 28, 2017 Getting started Open the Terminal (Search -> Applications -> Terminal), and