Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010
Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice junctions from mapped reads 2 4 Combining reads and calculating the FDR 3 5 A complete example 4 6 Using jfinder on its own 4 1
1 Motivation In an RNA-Seq experiment, it is often of interest to identify transcript isoforms de novo, without respect to known genome annotations. In this document, I will describe the usage of our scripts and software to perform an important part of this problem the identification of sequencing reads which span exon-exon junctions. Our goal was to develop a procedure that is flexible enough to identify a large fraction of splice junctions and is also able to quantify our confidence in the reliability of the identified junctions. The software is all available at http://eqtl.uchicago.edu/rna_seq_data/ Software/. We assume that, as an initial step, all the reads have been mapped to the genome. Our procedure can roughly be divided into three steps: 1. Identification of potential junction-spanning reads 2. Calling precise splice junctions from mapped reads 3. Combining reads and assessment of a false discovery rate (FDR) 2 Identification of potential junction-spanning reads First, we find all the reads with have not mapped to the genome, split the read in two, and map each end independently to the genome. We rely heavily on existing tools like bwa. One script which may be useful is sam unmapped2fq trim.py. This script inputs a.sam.gz file and outputs a fastq.gz file generated by trimming N bases from one end of each unmapped read. USAGE: sam unmapped2fq trim.py [input.sam.gz] [output.fastq.gz] [f or l for "first" and "last"] [N] For example, sam unmapped2fq trim.py testin.sam.gz testout.fastq.gz f 20 will output the first 20 bases of unmapped reads in testin.sam.gz in fastq format. The.fastq.gz files can then be used as input to bwa or any other mapping tool. 3 Calling splice junctions from mapped reads We provide a tool, jfinder, for identifying the precise splice junctions supported by a read after performing the above steps. First, we filter out reads where the different ends of the read come from different chromosomes or different strands or map too far apart. To perform this filtering, use filter pair sequences sam.py. USAGE: testttfilter pair sequences sam.py [first.sam.gz] [last.sam.gz] [notsplit.sam.gz] [output.gz] This inputs the output from step one, as well as the original data file, and output the reads 2
where at least one end of the read maps with a quality score of at least 10, and, if both ends map, they map to the same strand of the same chromosome and within 100kb of each other. Now we can use jfinder on this output. USAGE: jfinder -min [minimum intron length] -max [maximum intron length] -l [length of the each end] -i [input file] -o [output file] -c [chromosome name] -cf [chromosome file (.fa.gz)] This must be done on each chromosome separately. This program may be of interest on its own outside of the pipeline described below; the input and output files are described in a separate section. 4 Combining reads and calculating the FDR We now have, for each read, the positions of the splice junctions compatible with each read. The next step is to combine these reads to a list of all the splice junctions in the data. The script we use for this is read2junc.py. USAGE: read2junc.py [input file (.gz)] [output file] We can now calculate the FDR of the junctions, using count splice sites.py. USAGE: count splice sites.py [input file] [output file] Printed to stdout is the number of GT-AG or GC-AG junctions, along with the FDR. The output file contains the following fields: 1. the positions of the first splice site consistent with the reads 2. the positions of the second splice site consistent with the reads 3. the number of reads spanning the junctions 4. the splice site dinucleotides corresponding to each of the first splice sites 5. the splice site dinucleotides corresponding to each of the second splice sites 6. is the splice site consistent with being a GT-AG or GC-AG junction? (0: no, 1:GC-AG, 2:GT-AG) 7. is the splice site consistent with the control dinucleotides (GT-TC or GC-TC)? (0:no, 1:GC- TC, 2:GT-TC) 3
5 A complete example Imagine we have mapped a lane of reads to the genome, and have the output in testlane.sam.gz. Now, how do we identify all the splice junctions on chromosome 1 supported by these reads? Below are all the commands in order. sam unmapped2fq trim.py testlane.sam.gz testlane trim1.fastq.gz f 20 sam unmapped2fq trim.py testlane.sam.gz testlane trim2.fastq.gz l 20...run bwa on these output, gzip the.sam output... filter pair sequences sam.py testlane trim1.sam.gz testlane trim2.sam.gz testlane.sam.gz testlane.filtered.paired.gz jfinder -l 20 -min 30 -max 100000 -i testlane.filtered.paired.gz -o testlane.chr1.junctionreads.gz -c chr1 -cf chr1.fa.gz awk {if ($5 > 9 && $7>9 && $13 < 3) print $0} gzip - > testlane.chr1.filtered.junctionreads.gz (this filters out alignments with more than 2 mismatches and with less then 10 bases on either side of the splice junction) read2junc.py testlane.chr1.filtered.junctionreads.gz chr1.juncs count splice sites.py chr1.juncs chr1.juncs.wss 6 Using jfinder on its own Once two ends of a sequencing read have been mapped separately, jfinder can be used to find the splice junctions consistent with each read. As described above, usage is as follows: USAGE: jfinder -min [minimum intron length] -max [maximum intron length] -l [length of the each end] -i [input file] -o [output file] -c [chromosome name] -cf [chromosome file (.fa.gz)] The input file is in the following format. On each line, separated by whitespace, are the following columns: 1. read name 2. sequence of read 4
3. strand 4. chromosome 5. position of first part of read (or NA) 6. position of second part of read (or NA) For example: HWI-EAS134:6:1:0:1724#0 CTTACTCACCCCAGCATGGAAACTACCACGAGGAG + chr8 NA 145137857 HWI-EAS134:6:1:0:1633#0 TGCACCGGTGCAGCCTCCCATGTCGCAGGCGGAGG + chrx NA 1497876 The output file contains the following columns (one for each read where a junction was found): 1. read name 2. chromosome 3. sequence of read 4. start of alignment 5. length of first part of alignment after extension (note that for reads where both ends map, this will be the length of the aligned fragment) 6. end of alignment 7. length of second part of alignment after extension (note that for reads where both ends map, this will be the length of the aligned fragment). 8. possible positions of first splice site (the first base of the intron), comma-separated 9. possible positions of the second splice site (the first base of the exon), comma-separated 10. corresponding intronic dinucleotides for each possible first splice site, comma-separated 11. corresponding intronic dinucleotides for each possible second splice site, comma-separated 12. number of possible splice sites 13. number of mismatches to the genome 14. did both ends of the original read map to the genome? (both = yes, one = no) For example, one such line might look like this: HWI-EAS134:6:1:245:1272#0 chr21 CCGACGTGCACCTTGATGAAGTAGTTTGTCCCCGC 44018629 9 44018991 26 44018638,44018639,44018640,44018641,44018642, 44018965,44018966,44018967,44018968,44018969, AC,CC,CT,TG,GG, CT,TA,AC,CC,CT, 5 0 one 5