Unix tutorial, tome 5: deep-sequencing data analysis

Size: px

Start display at page:

Download "Unix tutorial, tome 5: deep-sequencing data analysis"

Daisy Curtis
5 years ago
Views:

1 Unix tutorial, tome 5: deep-sequencing data analysis by Hervé December 8, 2008 Contents 1 Input files 2 2 Data extraction Overview, implicit assumptions Usage Screenshots Things to know 10 1

2 1 Input files The primary output of the Solexa sequencer is a series of images of the flowcell (one for each position, from 1 to 36, of the cdna, and for each nucleotide identity); people at the deep-sequencing core facility use these files to extract several kinds of files: sequence files (suffix: seq.txt; that s the files where we will find the RNA or DNA sequences), as well as several other files, which contain informations on the quality of the sequencing reaction (suffixes: sig2.txt, prb.txt and qhg.txt). Once Ellen Kittler has finished preparing these files, she packs them into compressed archives (suffix:.tar.gz; she usually makes one archive per sample), and she puts them in this directory on Binar: /share/nemo/kittlere/deep SEQ DATA/ZAMORE/ (actually, this is where she puts the data for our lab; in addition to the ZAMORE directory, she also has a RANDO directory, a MELLO directory, and so on), and she sends an to the person who submitted the sample, to let him/her know that the analysis is done. It is particularly important to back up these files as soon as possible: due to their size (one individual sample usually generates several gigabytes of seq.txt, sig2.txt, prb.txt and qhg.txt files), she cannot store them more than two weeks after this delay, she has to delete them (she usually sends an before actually deleting them). Of course, we need to keep the seq.txt files, in order to be able to re-extract the sequences if needed. But we also have to keep the sig2.txt and prb.txt files to submit them to the NCBI s Short Read Archive (SRA) 1. This database stores deep-sequencing raw data, and most journals will ask for an SRA accession number every time you want to publish deep-seq. data (the reason is that it is important to provide all the experimental results, for others to be able to re-analyze the data; yet deep-sequencing datasets are too large to be published, even as supplemental data: the aim of this public database is to provide a long-term storage for these datasets, that can be accessed by anybody). The guidelines for SRA data submission ( explain that several types of raw data files are accepted (including the flow cell images themselves); among those, the easiest to get, for us, are the seq.txt, sig2.txt and prb.txt: Ellen Kittler prepares them for us. The.tar.gz archives contain many seq.txt, sig2.txt,... files: each one has a name looking like: s seq.txt. All these file names start with s ; the number after this s (here: 5) is the lane number, in the flow cell (in other words: that s the identity of the sample: we usually load one sample per lane; as there are 8 lanes per flow cell, this number will always be between 1 and 8 well, unless one day Illumina changes the format of their flow cells); the four-digit number is the identity of the tile in the lane (each lane is actually not read in one shot: it is divided in 330 squares, called tiles ; each tile contains many sequenced clusters each cluster derives from a single DNA molecule, captured on the flow cell surface, then PCR-amplified). So s seq.txt contains the sequence information of all the clusters in tile #13 of lane #5. When you open a seq.txt file (for example, with less), you should see something like: GTCCGACGATCTGTCAGTTTGTCAAATACCCCACTG GTTAAATTATAGGCTAAATCCTATATAAAACTGTAG GTCCGACGATCAGCAGCATTGTACAGGGCTATGACT TGAGG...G GTCAGTTTGTCAAATACCCCAACTGTAGGCACCATC 1 2

3 Each line in this file corresponds to one cluster in that tile (hence: it corresponds to one DNA molecule in your library). In addition to the cluster sequence (at the end of each line), it also gives the lane number (first number of each line), the tile number (second number), and the position of that cluster in the tile (x and y, third and fourth number); all these fields are separated by tabulations. Some sequences contain dots: these are ambiguous nucleotides (which are usually represented by N s by other programs). 2 Data extraction 2.1 Overview, implicit assumptions The program I wrote ( Rajkumari.sh ) extracts the unambiguous sequences from the seq.txt files, finds the ones where the 5 end of the 3 adapter can be found (we usually ask for the first 7 nt of the 3 adapter to be perfectly read), then extracts the sequence upstream the 3 adapter: these sequences ( inserts ) are the sequences of small RNAs from your original sample. The program then demultiplexes these sequences (if a given sequence let s say: the let-7 sequence has been read 1,000 times, it will be present only once in the demultiplexed sequence file, with a tag indicating that it was read 1,000 times). This step really speeds up the rest of the analysis (for example, instead of mapping that sequence 1,000 times on the genome, it will be mapped only once of course, as these 1,000 reads are identical, the genomic hits for these 1,000 reads will be the same). Then the program selects those inserts that map perfectly on the fly genome 2. There is an assumption here: we consider that only the genome-matching sequences are of interest; it is not always the case (if you are interested in untemplated additions on mirnas, small RNA editing, small RNAs coming from exon-exon junctions,..., you will have to specifically skip that step). The next step is the elimination of abundant non-coding RNA-matching reads (I call abundant non-coding RNAs the ribosomal RNAs, trna, snrnas and snornas, as well as the most frequently found rrna variants in our previous experiments). Once again, this assumes that you are not interested in these sequences. Then the program identifies all the reads that map perfectly on known fly pre-mirnas (actually: on the mirbase-annotated hairpins, extended by 10 nt on each end, to recognize the processing variants that extend beyhond the extremities of the mirbase-annotated hairpins). Depending on the experiment you are doing, you might want to specifically select the pre-mirnamatching reads (for example, if you re looking at mirnas), or to exclude them (for example, if you re looking at pirnas): the program extracts the mirna- and mirna*-matching reads (and stores them in two files per sample, called heterogeneously-mir-matching reads in [...].dat and heterogeneously-mirstar-matching reads in [...].dat ); and it generates fasta files containing all the genome-matching, non-(abundant non-coding RNA)-matching, non pre-mirna-matching sequences (one fasta file per sample; these files are called non pre-mirna-matching non ncrnamatching genome matching [...].fa ). Finally, the program counts how many inserts, and how many unique sequences, were selected at each step, and generates a.csv file 3, called Extraction statistics [...].csv, with all these numbers. 2 If you want to map your sequences on another genome, you will have to patch the program, and make sure that the genome of interest has been installed in Binar: check that with David Lapointe. 3 That you can open with Excel. 3

4 It also plots the size distribution histograms of the genome-matching, non-(abundant non-coding RNA)-matching reads, in two histograms per sample: one that gives the number of reads for each size class, and the other that gives the number of unique sequences for each size class; these files are called: size distribution non ncrna-matching genome matching [...].eps and size distribution unique sequences non ncrna-matching genome matching [...].eps. It also plots the size distribution histograms of the (same RNAs, but excluding the pre-mirna-matching ones): these files are called: size distribution non pre-mirna-matching non ncrna-matching genome matching [...].eps and size distribution unique sequences non pre-mirna-matching non ncrna-matching genome matching [...].eps. 2.2 Usage Run the program on a multi-processor cluster (like Binar), where the Drosophila melanogaster genome has been downloaded, and pre-processed for Eland (Eland is a genome-matching program written by Solexa; it maps small reads on genomes, allowing up to two mismatches, and, for the reads that map uniquely on the genome, it gives the location of their genomic hit). Open the compressed archive Rajkumari.tar.bz2 (typing: tar -xjf Rajkumari.tar.bz2 in the directory where you saved the archive), then type./rajkumari.sh to run the program. You will have to answer a few questions, then the program will run (for a few hours to a few days, depending on the type of analysis you asked). 2.3 Screenshots Figure 1: Decompressing the archive. My directory Tutorial contained just the compressed archive; tar -xjf (see man tar for the explanation of these options) opened the compressed archive, and extracted all the files it contained (there are 13 of them). 4

5 Figure 2: Starting the program. This screenshot was taken just before I pressed Enter. Figure 3: Starting the program. This screenshot was taken just after I pressed Enter. 5

6 Figure 4: Name of the sample series. The name you will enter (here, I chose 28OCT08 ) will be included in the names of all the files that the program will generate so they won t get mixed up with your other data sets. I recommend to use the date of the Solexa run as a name for the series: the program will then be able to locate the data files by itself (Ellen Kittler always names the data directories with a character string that starts with the Solexa run date); the date must be written in the format: ddmmmyy, with MMM: month (three-letter code, in capitals). If you want to give another name to your series, the program won t be able to locate your data files, and you will have to enter their location by hand (similarly: if you don t want to analyze all the data files that it found, or if you want to analyze them, as well as other ones, you will have to answer n to the question Are they the ones you want to analyze (y/n)?, then enter their location by hand). Such a location could be /share/nemo/kittlere/deep SEQ DATA/ZAMORE/19MAY08 FC20AVDAA DATA/ SEQs FC20AVDAA Lane.8.tar.gz ; if there are several locations, separate them with a space when you type them. 6

7 Figure 5: Type of analysis. The first type (choose it by entering 1 ) processes the data as described in subsection 2.1, page 3); the final outputs of this analysis are: a series of fasta files (containing the cluster sequences, the extracted insert sequences, the genome-matching insert sequences, the non-(abundant non-coding RNA)-matching, genome-matching sequences, and the nonpre-mirna-matching, non-(abundant non-coding RNA)-matching, genome-matching sequences), a.csv file (giving the numbers of inserts and unique sequences in each of these subsets), and a series of histograms, in the.eps format, showing the size distributions of the last two subsets. One series of fasta and.eps files is generated for each sample in the series (here, there are 4 samples: EPminus, that was loaded on lane 8; EPplus, on lane 7; GAminus, on lane 6; and GAplus, on lane 5), but a single.csv file will be generated, containing the statistics for every sample. The second type of analysis (chose it by typing 2 ) will do the same, but then, it will annotate the non-pre-mirnamatching, non-(abundant non-coding RNA)-matching, genome-matching sequences (the output of that annotation is a series of.csv files one per sample describing the genomic hits of every sequence: on what chromosomes and at what positions it maps, if the hit falls on an annotated gene, or on an annotated transposon, what is the closest gene (and its distance) if it doesn t, and what is the sequence of the genomic context of that hit (200 nt, centered on the first nucleotide of the read). The third type of analysis (chose it by typing 3 ) will do the same, but then, it will identify the clusters of small RNAs (excluding the ones that match on pre-mirnas) on the genome, using the definition given by [Brennecke et al., 2007]; the output of that clustering analysis is a.csv file containing the genomic coordinates of the identified clusters, and the number of small RNA hits they contain (including or excluding the sequences that also map elsewhere on the genome). 7

convenience, the program displays the sequences of the two adapters we commonly use in the lab: IDT s linker-1,

8 Figure 6: 3 adapter sequence. Here you have to enter the sequence of the 3 adapter you used to prepare your libraries, omitting the 5 adenylate (for convenience, the program displays the sequences of the two adapters we commonly use in the lab: IDT s linker-1, CTGTAGGCACCATCAAT, and Chengjian s adapter, TCGTATGCCGTCTTCTGCTTG). Figure 7: Automatic . As the analysis can be quite long, you may want to receive an automatic , telling you that it s done (type 2 or 3 if you want to receive an automatic ). 8

address (if you want to enter several email addresses, separate them with

9 Figure 8: address. If you typed 2 or 3 at the last question, you now have to enter your address (if you want to enter several addresses, separate them with spaces). Figure 9: The program is now running. It will periodically indicate what it is doing (here, it is starting to extract small RNA sequences). 9

10 3 Things to know 1. Output files are named after the sample lane number, and the series number; for example, sequences from the sample loaded on lane #8 of the 28OCT08 series will be called s 8 28OCT08.fa (this file contains the sequences from all the unambiguously read clusters), inserts s 8 28OCT08.fa (inserts from the cluster sequences where the 3 adapter was found), reduced inserts s 8 28OCT08.fa (demultiplexed version of inserts s 8 28OCT08.fa : see page 3); genome matching s 8 28OCT08.fa (genome-matching inserts); non ncrnamatching genome matching s 8 28OCT08.fa (among the genome-matching inserts, those that do not match on abundant non-coding RNAs); and non pre-mirna-matching non ncrnamatching genome matching s 8 28OCT08.fa (among those: the ones that do not match on known Drosophila pre-mirnas). 2. On my account on Binar, I found it more convenient to perform the small RNA annotation in a separate directory (namely, ~seitzh/deepseq/genomematching/), while I usually run Rajkumari.sh from ~seitzh/deepseq/ itself; if you want to organize your directories differently, you will have to patch lines 105 and 106 of Rajkumari.sh. 3. The Drosophila melanogaster genome is updated from time to time; either the genome annotation only is modified (meaning that the genome sequence does not change), or the genome assembly itself is updated. I am currently using version 5.5 of FlyBase s gene annotations (it is based on version 5 of the genome assembly); for a more up-to-date annotation, you will have to download the annotation files from FlyBase 4 in your GenomeMatching/ directory; when the genome assembly changes (the next version will be #6, and FlyBase s annotations will be numbered 6.1, then 6.2, etc), you will have to ask David Lapointe to update /share/apps/genomes/dm5.5 on Binar. 4. The list of micrornas is also periodically updated; right now, I am using mirbase s version 10.1 (dated December 19, 2007). The current release of mirbase is 12.0 (dated October 29, 2008), but the updates did not affect D. melanogaster sequences (they are still the same than in December 2007). It is important to use the latest update of mirbase, when you annotate pre-mirnamatching reads: if one day the list of D. melanogaster mirnas changes, you will have to update the extended pre-mirna sequences (they are in file Fused extended hairpinsdec07.fa : that s a fasta file, whose sequences must fit on single lines; it contains the sequences of mirbase s hairpins, extended by 10 nt of genomic flanking sequence on each side). 5. Avoid to run two (or more) sessions of Rajkumari.sh simultaneously: most of the tasks are actually splitted between several nodes of the cluster; the master script (Rajkumari.sh) then checks periodically the progress of each node (and it proceeds to the next step once every node has completed its job). If two Rajkumari.sh are run in the same time, they will use many nodes (by default, each one uses 60 nodes out of 130, but you re not the only user on the machine...), and each Rajkumari.sh will have to wait until all 120 nodes are done (meaning that the fastest analysis will have to wait till the slowest one is done before proceeding to the next step)

11 6. In the output fasta files, sequences are demultiplexed (see page 3), and the title lines (starting with > ) give the abundance of the corresponding sequences (after the keyword multiplicity= ). For example, these lines: >inserts s 8 28OCT multiplicity=21 AAAATACCTAAACGTCAGCGACG mean that sequence AAAATACCTAAACGTCAGCGACG has been read 21 times in that sample (the name of the sample is given at the beginning of the title line: this is sample #8 of the 28OCT08 series; the identifier of that particular sequence in this sample is: 4295). 7. If you want to re-plot the size distribution histograms (for example, with Igor): the data used to generate these histograms is in the files size distribution non ncrna-matching genome matching [...].dat ; these are space-delimited text files (you can open them with Excel; alternatively: you can convert them to.csv files with sed: sed s, g name of file). References [Brennecke et al., 2007] Brennecke, J., Aravin, A. A., Stark, A., Dus, M., Kellis, M., Sachidanandam, R. and Hannon, G. J. (2007). Cell 128,

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were