xibd.r: A supplementary Tutorial to:

Size: px

Start display at page:

Download "xibd.r: A supplementary Tutorial to:"

Roderick Edwards
5 years ago
Views:

1 xibd.r: A supplementary Tutorial to: Combined Linkage and Mutation Detection of Targeted X Chromosomal Resequencing Data Peter Krawitz 1,2,3,, Sebastian Bauer 1,3,, Christian Rödelsperger 1,2,3,, Jochen Hecht 1,2, Andreas Tzschach 3, and Peter Robinson 1,2,3, 1 Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, Germany 2 Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Berlin, Germany 3 Max Planck Institute for Molecular Genetics, Ihnestrasse 73, Berlin, Germany *These authors contributed equally. Contact: christian.roedelsperger@charite.de, peter.krawitz@gmail, sebastian.bauer@charite Corresponding author: peter.robinson@charite.de. Abstract This tutorial explains how to use the R script xibd.r in order to perform IBD=1 analysis on X chromsomal sequences of three or more males affected with X-linked diseases with the goal of identifying the chromosomal region for which each affected male has inherited the identical haplotype. Consult Section 2 for information about installing R and installing the R script xibd.r as well as other required R packages, consult Section 3 for information about the format of the data input files you will need to run the script. Section 4 explains how to use alternative data sources for different genome builds and recombination rates. Section 5 provides information about how to run the script. Finally, Section 6 explains how to create a plot of the results. 1 Introduction This document explains how to use the R script xibd.r, which implements the algorithm described in the main manuscript. This document will not give many details of the algorithm itself, for that we refer to the main manuscript. xibd.r and all supplementary files and programs can be downloaded from our website at: (NOTE TO REVIEWERS: The tutorial website is not now visible from our main webpage; this will be updated following publication). 1

2 2 Installation R is a freely available programming language and environment that is primarily designed for statistical and scientific computing and graphics. Installation programs for all major operating systems can be downloaded from the R homepage ( Many Linux distributions such as Debian allow users to install R directly from the package management system. Read the documentation for your operating system (Windows, Linux, Mac) at the CRAN site. xibd.r makes use of the R package RHmm to perform Hidden Markov Model analysis. We have added some functionalities to the basic RHmm package that are not yet available in the official package that can be downloaded from CRAN or Rforge. Therefore, download the version of RHmm from our website and install the package on your system as follows: $ R CMD INSTALL RHmm.tar.gz Note that you may need to have root permissions to install this package depending on the setup of your system. Alternatively, read the documentation for installing the package on your operating system. (NOTE TO REVIEWERS: Our modifications to RHmm are currently being tested and will be made part of the official package, probably prior to publication. This will make it easier to install the package. We will modify this tutorial accordingly). xibd.r additionally makes use of the R packages MASS, nlme, methods, RCurl, and bitops. If you get an error message about one of the packages needing to be installed, install the packages from the CRAN site. The IBD=2 algorithm requires data about recombination fractions. As a convenience, xibd.r will automatically download this data from the UCSC website if Bioconductor [1] and the Bioconductor package rtracklayer are installed. If Bioconductor is not installed on your system, see the homepage of the Bioconductor project for installation instructions ( The rtracklayer package allows annotation tracks to be transferred from the UCSC Genome Browser [2]. To install the package into a Bioconductor installation, enter the following commands: > source(" > bioclite("rtracklayer") 3 Input Data Before you can use the xibd.r script, you will need to convert your X-chromosomal variant data into ibs format (How you do this will obviously depend on the format of your data. For instance, it is reasonably simply to write a Perl script that uses the information from Variant Call Format, VCF version 4.0, files for all affected males to generate the ibs file.) The ibs format records whether all sequenced individuals are observed to have the same genotype at a certain position t. We refer to the event of observing same genotypes as IBS* and denote this in the ibs format with 1 in the third column. If there is one or more divergent genotype at position t, then a 0 is written to denote IBS*. We denote the set of all positions that are IBS* or IBS* as IBS* or IBS*. There are two possibilities for n sequenced individuals: 2

3 All n individuals have the same variant call at position t. Therefore, t IBS*. At least one analyzed individual has a variant call at position t, but less than n analyzed individuals have the variant called at this position. Therefore t IBS*. Consider the following line from a VCF file: X G A 2553 DP=144;Dels=0.00 GT:DP:GL:GQ 1/1:144: The following table explains the meaning of the items. Field Example Explanation CHROM X The chromosome POS nucleotide position on chromosome ID. optionally, an ID such as an rs number for dbsnp can be shown REF G Base in reference sequence ALT A observed, variant base QUAL 2553 quality score FILTER DP=144;D... filters such as coverage INFO GT:DP... additional information about the sample FORMAT 1/1:... genotype information, 1/1 for homozygous 0/1 for heterozygous Thus, this line states that there has been a call at chromosome X, position , and that the genotype is homozygous: 1/1, i.e., A/A (for males the genotype is actually hemizygous A/-). The genotype information is not phased. In order to generate an ibs file, we can use information from the CHROM, POS, REF, ALT, and FORMAT columns. We will illustrate how this is done using an example in which we are sequencing four brothers. For each of them, we will summarize the information in the form CHROM:POS:[REF ALT]. brother1 brother2 brother3 brother4 IBS* (1) chrx:148:[g A] chrx:148:[g -] chrx:148:[g -] chrx:148:[g -] 0 (2) chrx:154:[a G] chrx:154:[a G] chrx:154:[a G] chrx:154:[a -] 0 (3) chrx:354:[a -] chrx:354:[a T] chrx:354:[a T] chrx:354:[a T] 0 (4) chrx:2356:[a T] chrx:2356:[a T] chrx:2356:[a T] chrx:2356:[a T] 1 (5) chrx:12777:[c T] chrx:12777:[c T] chrx:12777:[c T] chrx:12777:[c T] 1 xibd.r requires an input file with three columns; the first column indicates the chromosome, the second the position on that chromosome, and the third whether the locus was IBS* in all analyzed individuals. The columns should be separated by space(s) or tabs. We will refer to files with this format as ibs files in the rest of this document. For example: chrx chrx chrx

4 chrx chrx chrx chrx Allele Frequencies and Recombination Rates The script requires position specific variant frequency data and X-chromosomal recombination rates. We provide non-reference allele frequencies for common single nucleotide variants in hg18 coordinates derived by the 1000 genomes project as additional file chrx_hg18_nra.txt on our website. The script will automatically load data on recombination rates for hg18 from the UCSC Genome Browser website [4]. This is based on the table recombrate, which describes the recombination rate in 1Mb intervals based on decode map. Note that if your data are not expressed using hg18 genomic coordinates, you will need to modify this script (or provide other files with recombination rates and allele frequencies) to get correct results. Alternatively, you can specify a directory in which the recombination rates from another source deposited. This directory should contain files named as "chr1.sm.map2", "chr2.sm.map2" and so on. We have not noticed substantial differences in the output of xibd.r depending on whether decode (as downloaded from the UCSC website) or Rutgers [3] recombination data is used. 5 Running xibd.r xibd.r can be run from the command line (note that you may need to give script file execute permissions before invocation). xibd.r has three mandatory command-line arguments. By default, the results of the analysis are written to the standard output stream (Stdout). The first mandatory command-line argument is the name of the input file (here input.ibs, the file must contain the input data formatted as described above). The second mandatory command-line argument indicates the number of individuals that have being analyzed and the third argument indicates the number of maternal meioses the lie between these individuals. For instance:./xibd.r test.ibs 4 4 > test.ibd This performs IBD=1 analysis for four brothers that are 4 meioses appart and the result is put into a file called test.ibd The result is of the following format: chrom loc obs.ibs pred.ibd marg chrx chrx chrx chrx chrx chrx

5 chrx chrx chrx For each chromosomal position (chrom and loc, the file indicates whether the position was observed to be IBS* (obs.ibs, this information is taken from the input file), whether it is predicted to be IBD=1 (pred.ibd, where 0=predicted not IBD=1 and 1=predicted IBD=1). The last column indicates the marginal probability of being in the IBD=1 state. Note that the column pred.ibd represents the result of the Viterbi algorithm and the column marg represents the result of the forward-backward algorithm. Note that the script is written such that it expects the file chrx_hg18_nra.txt to be in the same directory from which the script was started. 5

6 6 Plotting the IBD=1 Predictions The following R code inputs the output file from the xibd.r script (which we have assumed is called test.ibd). In this example, the results for the X chromosome are extracted using the which command on line 6. Figure 1 shows the output of the script. 1 dat < read. t a b l e ( "out.txt", header=t ) 2 lod < sapply ( dat [, "marg" ], f u n c t i o n ( x ) { log10 ( x / (1 x ) ) } ) 3 df < data. frame ( chrom=dat [, "chrom" ], pos=dat [, "loc" ], lod=lod ) 4 5 ## p l o t X chromosome 6 ind < which ( d f $chrom=="chrx" ) 7 X < df $pos [ ind ] / 1e6 ## P o s i t i o n i n Mb 8 Y < df $ lod [ ind ] 9 p l o t ( X, Y, type="l", xlab ="Position on chromosome X (Mb)", 10 ylab="lod", cex. axis =1.5, cex. lab =1.5) 11 a b l i n e ( h=0) lod chrx Figure 1: IBD=1 Analysis of the test data set. After running the command./xibd.r test.ibs 4 4 > test.ibd on the test.ibs dataset from our website, you can display the results using the R commands that are explained in Section 6. If all has gone well, you will obtain a plot identical to the one shown in this Figure. 6

7 References [1] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean Y H Yang, and Jianhua Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, 5(10):R80, [2] Michael Lawrence, Robert Gentleman, and Vincent Carey. rtracklayer: an r package for interfacing with genome browsers. Bioinformatics, 25(14): , Jul [3] Tara C Matise, Fang Chen, Wenwei Chen, Francisco M De La Vega, Mark Hansen, Chunsheng He, Fiona C L Hyland, Giulia C Kennedy, Xiangyang Kong, Sarah S Murray, Janet S Ziegle, William C L Stewart, and Steven Buyske. A second-generation combined linkage physical map of the human genome. Genome Res, 17(12): , Dec [4] Brooke Rhead, Donna Karolchik, Robert M Kuhn, Angie S Hinrichs, Ann S Zweig, Pauline A Fujita, Mark Diekhans, Kayla E Smith, Kate R Rosenbloom, Brian J Raney, Andy Pohl, Michael Pheasant, Laurence R Meyer, Katrina Learned, Fan Hsu, Jennifer Hillman-Jackson, Rachel A Harte, Belinda Giardine, Timothy R Dreszer, Hiram Clawson, Galt P Barber, David Haussler, and W. James Kent. The UCSC Genome Browser database: update Nucleic Acids Res, 38(Database issue):d613 D619, Jan

LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data - supplementary materials

LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data - supplementary materials Bingqing Lin 1, Li-Feng Zhang 1, and Xin Chen 1 School of Biological Sciences, School of