By Ludovic Duvaux (27 November 2013)

Size: px

Start display at page:

Download "By Ludovic Duvaux (27 November 2013)"

Jesse Skinner
6 years ago
Views:

1 Array of jobs using SGE - an example using stampy, a mapping software. Running java applications on the cluster - merge sam files using the Picard tools By Ludovic Duvaux (27 November 2013) The idea ========== One may have many inputs (samples, files...) to process. Sometimes, it is also possible that the processing of different parts of a single file can be made independently and in parallel on several cores (this possibility being not incompatible with the previous one). In both cases (or a mix on both), create an array of jobs using SGE can be of great help! For instance, most recent mapping programs allow to process only a fraction of a fastq input file (e.g. only 1/10th of your file willbe mapped on your reference genome) and one can have many samples tomap on a reference genome! For the current course, we'll do that using some pea aphid data. I) Preliminary steps ==================== I.1) Open an interactive session on the cluster =============================================== Very little software and libraries are installed on the Head-node. Therefore you may run into severe problems if you attempt to install anything 'under your account or otherwise' while working on cluster Head-node. [MyLogin@cluster ~]$ qrsh [MyLogin@node15 ~]$ I.2) create a new folder & fetch the training files =================================================== To do so: ln -s /data/mylogin/./data # create a symbolic link toward your data folder in the current folder. If the above doesn't work, go first to your data folder then come back in order to trigger your data folder on. cd /data/mylogin cd ln -s /data/mylogin./data Then: cd data

2 mkdir -p ArrayJobs_Stampy cp /usr/local/extras/genomics/hpc_course/stampy_example/script_files/* ArrayJobs_Stampy # fetch all useful files of these training ll ArrayJobs_Stampy I.3) Check if you can run stampy ===================================== Try to call stampy's manual. From anywhere, type: /usr/local/extras/genomics/applications/stampy/1.0.22/stampy.py You should obtain something like: [bo4cm17@testnode01 ~]$ /usr/local/extras/genomics/applications/stampy/1.0.22/stampy.py stampy v (r1848), <gerton.lunter@well.ox.ac.uk> Usage: /usr/local/extras/genomics/applications/stampy/1.0.22/stampy.py [options] [.fa files] Option summary (--help for all): Command options -G PREFIX file1.fa [...] Build genome index PREFIX.stidx from fasta file(s) on command line -H PREFIX Build hash PREFIX.sthash -M FILE[,FILE] Map fastq/fasta/bam file(s) -A FILE Convert qualities; strip adapters Mapping/output options -g PREFIX Use genome index file PREFIX.stidx -h PREFIX Use hash file PREFIX.sthash -o FILE Write mapping output to FILE [stdout] --readgroup=id:id,tag:value,... Set read-group tags (ID,SM,LB,DS,PU,PI,CN,DT,PL) (SAM format) --solexa, --solexaold, --sanger Solexa read qualities (@-based); pre-v1.3 Solexa; and Sanger (!-based, default) --substitutionrate=f Set substitution rate for mapping and simulation [0.001] --gapopen=n Gap open penalty (phred score) [40] --gapextend=n Gap extension penalty (phred score) [3] --bwaoptions=opts Options and <prefix> for BWA pre-mapper (quote multiple options) --bwamaxmismatch=n Max number of mismatches for BWA maps; - 1=auto [-1] --bwatmpdir=s Set directory for BWA temporary files --bwa=f Set BWA executable [default: bwa] --bwamark Include/mark BWA-mapped reads with XP:Z:BWA tag (produces more output lines) General options --help Full help -v N Set verbosity level (0-3) [2] If not, there is a problem in the installation. Note that stampy use python. See the file "HowToInstallOnIceberg_Python2.x_stampy.txt" on the cluster to install it if needed.

3 For greater simplicity, create a symbolic link of this executable in your bin folders (then you won't have to type again the "/usr/local/extras/genomics/applications/stampy/1.0.22/" before "stampy.py" each time afterwards): mkdir ~/bin # ~/bin is a special folder already included in your "PATH" even if it doesn't exists! ln -s /usr/local/extras/genomics/applications/stampy/1.0.22/stampy.py ~/bin # note the difference of behaviour when creating a symbolic link toward a folder or a file (see above) ll ~/bin try to run stampy again stampy.py To get more extensive help on stampy, just type: stampy.py --help I.4) prepare inputs & folders ============================= cd ~/data # go back in our data directory if needed mkdir -p ArrayJobs_Stampy/RefGenom mkdir -p ArrayJobs_Stampy/mapping_results mkdir -p ArrayJobs_Stampy/log_files cd ArrayJobs_Stampy cp -v /usr/local/extras/genomics/applications/stampy/1.0.22/refgenom/acypi_assembly2_rehead-noblank.stidx./refgenom # copy stampy's genome index in your folder cp -v /usr/local/extras/genomics/applications/stampy/1.0.22/refgenom/acypi_assembly2_rehead-noblank.sthash./refgenom # stampy's genome hash table II) run complementary mapping jobs in parallel ============================================== II.1) prepare the bash script to runs the jobs using SGE ======================================================== The first step is to prepare a bash script to run our jobs via SGE scheduler. The script allows declaring the amount of resources needed, the estimated run time of the job (important to go on the priority queue...). For our training, this file already exists: see RunArrayOfstampyJobs.sh II.2) prepare a text file with the command line options for stampy ================================================================== One great advantage of the command line tools is that many different specific command lines can be prepared in order to deal with different sample requirements. In our case, these differences are: - sample names (fastq files) - library names - fraction of the file to process - fastq format (VERY IMPORTANT to run stampy properly) The best thing to do before running our analysis is thus to record all these differences in a file.

4 For our training, this file already exists: "01_stampyCommandOptions.txt". II.3) write a script that will prepare specific command lines ============================================================= We have then to write a script that will read and interpret the command line option file in order to prepare specific command lines of reach sample/job in the array. To do so, the choice of the scritpting language mainly depends on own our preferences (perl, python, R...) even though some are probably more efficient than others. For our training, I wrote such a file in 'R': "02_Runstampy.R". II.4) run the bash script ========================= To do so, just type in the shell: qsub RunArrayOfStampyJobs.sh # run the bash scripts qstat grep bo4cm # check that the jobs are well on the queue qstat -u MyLogin When the jobs start, they create a SGE logfile with a name similar to "RunArrayOfstampyJobs.sh.o " where the first and second numbers are the job and task IDs respectively. In that log, you will find all the information that you may have script for in the bash script. With my current implementation, the time spent to run the job is indicated at the end. This is pretty useful to schedule your future job on the cluster. You'll also find the R and stampy's log files in the folder "~/ArrayJobs_Stampy/log_files" and the results files in "~/ArrayJobs_Stampy/mapping_results". III) Running java applications on the cluster - merge sam files using the Picard tools ================================================================================== The problem now is that using different cores to map the reads on the reference genome, we have created several sam files for the same sample, so we want to merge them for all the subsequent analyses. To do so, we can use the jar function "MergeSamFiles.jar" of the picard tools ( The great advantage of java programs is that their source code is totally inter-compatible among operating systems (i.e. once the code has been written for one OS (e.g. Windows), you don't need to modify it for other OS (e.g. Linux)). III.1) download the picard tools ================================ A very efficient way to transfer program to iceberg is to directly dowload them from the net using the command "wget": mkdir ~/Applications cd ~/Applications wget

5 Then have to unzip them: unzip picard-tools zip The picard tools are now ready to use. III.2) running the "MergeSamFiles.jar" funtion ============================================== cd $OLDPWD # allow to come back to the last directory we were in before the current one (see others environment variables by typing "set less") The corresponding command line is (WARNING: think to change the name of the sam files, i.e. the date is not the same): java -jar ~/Applications/picard-tools-1.103/MergeSamFiles.jar MAX_RECORDS_IN_RAM= CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-1_ sam I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-2_ sam O=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-all_ sam However, when we try we obtain something like that: Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. Indeed, you have to load a special module of iceberg to run java. First you can obtain the list of special modules available on iceberg by typing: module avail In our case, we are intereste by "apps/java/1.7", so jus type: module load apps/java/1.7 then, retry to run the picard function: java -jar ~/Applications/picard-tools-1.103/MergeSamFiles.jar MAX_RECORDS_IN_RAM= CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-1_ sam I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-2_ sam O=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-all_ sam results: [Wed Nov 27 11:40:54 GMT 2013] net.sf.picard.sam.mergesamfiles INPUT=[/home/bo4cm17/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-1_ sam, /home/bo4cm17/data/arrayjobs_stampy/mapping_results/lathyrus_n152_rawmapping-2_ sam] OUTPUT=/home/bo4cm17/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-all_ sam VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM= CREATE_INDEX=true SORT_ORDER=coordinate ASSUME_SORTED=false MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_MD5_FILE=false [Wed Nov 27 11:40:54 GMT 2013] Executing as bo4cm17@amd-node02 on Linux el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-root_2013_03_07_09_45-b00; Picard version: 1.103(1598) INFO :40:55 MergeSamFiles Sorting input files using temp directory [/tmp/bo4cm17] INFO :40:55 MergeSamFiles Finished reading inputs. [Wed Nov 27 11:40:55 GMT 2013] net.sf.picard.sam.mergesamfiles done. Elapsed time: 0.03 minutes. Runtime.totalMemory()= We can check that the number of reads in the resulting files is well the sum of the two previous files:

6 cat ~/data/arrayjobs_stampy/mapping_results/lathyrus_n152_rawmapping- 1_ sam grep RG:Z:Lathyrus_N152 wc -l # count the number of reads (each line corresponding to a read has the field "RG:Z:Lathyrus_N152" cat ~/data/arrayjobs_stampy/mapping_results/lathyrus_n152_rawmapping- 2_ sam grep RG:Z:Lathyrus_N152 wc -l cat ~/data/arrayjobs_stampy/mapping_results/lathyrus_n152_rawmappingall_ sam grep RG:Z:Lathyrus_N152 wc -l IMPORTANT NOTE: you may sometimes need to change the default parameters of java concerning the memory allocated to the java virtaul machine in order to run it properly, e.g.: java -Xms512m -Xmx7g ~/Applications/picard-tools-1.103/MergeSamFiles.jar MAX_RECORDS_IN_RAM= CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-1_ sam I=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-2_ sam O=~/data/ArrayJobs_Stampy/mapping_results/Lathyrus_N152_RawMapping-all_ sam - where Xms512m specifies that the minimum memory allocated to the java virtual machine will be 512 MB - where Xmx7g specifies that the maximum memory allocated to the java virtual machine will be 7GB In this case take care that Xmx does not exceed the memory you asked for the job in the SGE bash script, actually you even need to spare some RAM for others applications (well it's a guess of mine here), i.e.: If you ask 6GB for your the jobs, let's Xmx not exceed 5GB. Other parameters of the picard tools influencing the the memory usage may be important to finely set up for big data sets (see Picard manual for further information).

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

Introduction to NGS analysis on a Raspberry Pi Beta version 1.1 (04 June 2013)!! Contents Overview Contents... 3! Overview... 4! Download some simulated reads... 5! Quality Control... 7! Map reads using